[
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543141#comment-15543141
]
Dilip Biswal commented on SPARK-17709:
--------------------------------------
@ashrowty Hi Ashish, in your example, the column loyalitycardnumber is not in
the outputset and that is why we see the exception. I tried using productid
instead and got
the correct result.
{code}
scala> df1.join(df2, Seq("companyid","loyaltycardnumber"));
org.apache.spark.sql.AnalysisException: using columns
['companyid,'loyaltycardnumber] can not be resolved given input columns:
[productid, companyid, avgprice, avgitemcount, companyid, productid] ;
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
at
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2651)
at org.apache.spark.sql.Dataset.join(Dataset.scala:679)
at org.apache.spark.sql.Dataset.join(Dataset.scala:652)
... 48 elided
scala> df1.join(df2, Seq("companyid","productid"));
res1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 2
more fields]
scala> df1.join(df2, Seq("companyid","productid")).show
+---------+---------+--------+------------+
|companyid|productid|avgprice|avgitemcount|
+---------+---------+--------+------------+
| 101| 3| 13.0| 12.0|
| 100| 1| 10.0| 10.0|
+---------+---------+--------+------------+
{code}
> spark 2.0 join - column resolution error
> ----------------------------------------
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Ashish Shrowty
> Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from <hivetable>")
> val df1 = d1.groupBy("key1","key2")
> .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
> .agg(avg("itemcount").as("avgqty"))
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code
> works. This same code above worked with Spark 1.6.2
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]