[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

Dilip Biswal (JIRA) Mon, 03 Oct 2016 12:01:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543141#comment-15543141
 ]


Dilip Biswal commented on SPARK-17709:
--------------------------------------

@ashrowty Hi Ashish, in your example, the column loyalitycardnumber is not in 
the outputset and that is why we see the exception. I tried using productid 
instead and got
the correct result.

{code}
scala> df1.join(df2, Seq("companyid","loyaltycardnumber"));
org.apache.spark.sql.AnalysisException: using columns 
['companyid,'loyaltycardnumber] can not be resolved given input columns: 
[productid, companyid, avgprice, avgitemcount, companyid, productid] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2651)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:679)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:652)
  ... 48 elided

scala> df1.join(df2, Seq("companyid","productid"));
res1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 2 
more fields]

scala> df1.join(df2, Seq("companyid","productid")).show
+---------+---------+--------+------------+                                     
|companyid|productid|avgprice|avgitemcount|
+---------+---------+--------+------------+
|      101|        3|    13.0|        12.0|
|      100|        1|    10.0|        10.0|
+---------+---------+--------+------------+
{code}

> spark 2.0 join - column resolution error
> ----------------------------------------
>
>                 Key: SPARK-17709
>                 URL: https://issues.apache.org/jira/browse/SPARK-17709
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ashish Shrowty
>              Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from <hivetable>")  
> val df1 = d1.groupBy("key1","key2")
>           .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>           .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

Reply via email to