[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

Animesh Baranawal (JIRA) Tue, 23 Jun 2015 03:51:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597271#comment-14597271
 ]


Animesh Baranawal edited comment on SPARK-8072 at 6/23/15 9:28 AM:
-------------------------------------------------------------------

[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

If we are supporting only distinct columns in a dataframe, then we dont need 
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which 
results in a dataframe with multiple columns with identical names. This can 
also be resolved with a minor tweak in the code of 'describe'.

Should the first test then be removed?


was (Author: animeshbaranawal):
[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

If we are supporting only distinct columns in a dataframe, then we dont need 
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which 
results in a dataframe with multiple columns with identical names. This can 
also be resolved with a minor tweak in the code of agg.

Should the first test then be removed?

> Better AnalysisException for writing DataFrame with identically named columns
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-8072
>                 URL: https://issues.apache.org/jira/browse/SPARK-8072
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit 
> error message saying there are duplicate columns. See current error message 
> below. 
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---------------------------------------------------------------------------
> Py4JJavaError                             Traceback (most recent call last)
> <ipython-input-4-eecb85256898> in <module>()
> ----> 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
> mode)
>     350         >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
>     351         """
> --> 352         self._jwrite.mode(mode).parquet(path)
>     353 
>     354     @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
>  in __call__(self, *args)
>     535         answer = self.gateway_client.send_command(command)
>     536         return_value = get_return_value(answer, self.gateway_client,
> --> 537                 self.target_id, self.name)
>     538 
>     539         for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
>  in get_return_value(answer, gateway_client, target_id, name)
>     298                 raise Py4JJavaError(
>     299                     'An error occurred while calling {0}{1}{2}.\n'.
> --> 300                     format(target_id, '.', name), value)
>     301             else:
>     302                 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
> be: age#0L, age#3L.;
>       at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>       at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>       at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>       at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>       at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>       at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>       at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>       at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>       at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>       at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>       at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>       at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>       at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>       at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>       at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>       at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>       at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>       at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:127)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8.applyOrElse(Analyzer.scala:341)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8.applyOrElse(Analyzer.scala:243)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>       at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:243)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:242)
>       at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:61)
>       at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>       at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>       at scala.collection.immutable.List.foldLeft(List.scala:84)
>       at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:59)
>       at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:51)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:51)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:903)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:903)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
>       at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
>       at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>       at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:98)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>       at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>       at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:920)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:920)
>       at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:338)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
>       at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>       at py4j.Gateway.invoke(Gateway.java:259)
>       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>       at py4j.commands.CallCommand.execute(CallCommand.java:79)
>       at py4j.GatewayConnection.run(GatewayConnection.java:207)
>       at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

Reply via email to