[
https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822984#comment-16822984
]
Hyukjin Kwon commented on SPARK-19860:
--------------------------------------
Does the size of data matter to reproduce this issue, or the query is expected
to be complicated? From what you guys said, it doesn't look too difficult to
post a reproducer.
> DataFrame join get conflict error if two frames has a same name column.
> -----------------------------------------------------------------------
>
> Key: SPARK-19860
> URL: https://issues.apache.org/jira/browse/SPARK-19860
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.1.0
> Reporter: wuchang
> Priority: Major
>
> {code}
> >>> print df1.collect()
> [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302',
> in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305),
> Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224',
> in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381),
> Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303',
> in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746),
> Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227',
> in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812),
> Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203',
> in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303),
> Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209',
> in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569),
> Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213',
> in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203),
> Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210',
> in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829),
> Row(fdate=u'20170301', in_amount1=10159653)]
> >>> print df2.collect()
> [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302',
> in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110),
> Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224',
> in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502),
> Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303',
> in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218),
> Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227',
> in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154),
> Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203',
> in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662),
> Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209',
> in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389),
> Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213',
> in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951),
> Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210',
> in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422),
> Row(fdate=u'20170301', in_amount2=9475418)]
> >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
> 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially
> true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
> jdf = self._jdf.join(other._jdf, on._jc, how)
> File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 933, in __call__
> File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"
> Failure when resolving conflicting references in Join:
> 'Join Inner, (fdate#42 = fdate#42)
> :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double))
> as int) AS in_amount1#97]
> : +- Filter (inorout#44 = A)
> : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47,
> fdate#42]
> : +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) &&
> (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> : +- SubqueryAlias history_transfer_v
> : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40,
> fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47,
> bankwaterid#48, waterid#49, waterstate#50, source#51]
> : +- SubqueryAlias history_transfer
> : +-
> Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
> parquet
> +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double))
> as int) AS in_amount2#145]
> +- Filter (inorout#44 = B)
> +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47,
> fdate#42]
> +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) &&
> (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> +- SubqueryAlias history_transfer_v
> +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40,
> fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47,
> bankwaterid#48, waterid#49, waterstate#50, source#51]
> +- SubqueryAlias history_transfer
> +-
> Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
> parquet
>
> Conflicting attributes: fdate#42
> {code}
> Only when I use .withColumnRenamed('fdate','fdate2') method to change df1's
> column fdate to fdate1 and df2's column fdate to fdate2 , the join is ok.
> So ,my question is ,why the conflict happened?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]