Alexis Seigneurin created SPARK-10925:
-----------------------------------------

             Summary: Exception when joining DataFrames
                 Key: SPARK-10925
                 URL: https://issues.apache.org/jira/browse/SPARK-10925
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.5.1, 1.5.0
         Environment: Tested with Spark 1.5.0 and Spark 1.5.1
            Reporter: Alexis Seigneurin
         Attachments: TestCase2.scala

I get an exception when joining a DataFrame with another DataFrame. The second 
DataFrame was created by performing an aggregation on the first DataFrame.

My complete workflow is:

# read the DataFrame
# apply an UDF on column "name"
# apply an UDF on column "surname"
# apply an UDF on column "birthDate"
# aggregate on "name" and re-join with the DF
# aggregate on "surname" and re-join with the DF

If I remove one step, the process completes normally.

Here is the exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
birthDate_cleaned#8];
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
        at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
        at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
        at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
        at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
        at TestCase2$.main(TestCase2.scala:51)
        at TestCase2.main(TestCase2.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
{code}

I'm attaching a test case that I tried with Spark 1.5.0 and 1.5.1. Please note 
it used to work with version 1.4.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to