[
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597271#comment-14597271
]
Animesh Baranawal edited comment on SPARK-8072 at 6/23/15 8:24 AM:
-------------------------------------------------------------------
[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in
DataFrameSuite to fail. The tests are :
1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe
If we are supporting only distinct columns in a dataframe, then we dont need
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which
results in a dataframe with multiple columns with identical names. How to
resolve this?
was (Author: animeshbaranawal):
[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in
DataFrameSuite to fail. The tests are :
1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe
Can you suggest how to resolve this?
> Better AnalysisException for writing DataFrame with identically named columns
> -----------------------------------------------------------------------------
>
> Key: SPARK-8072
> URL: https://issues.apache.org/jira/browse/SPARK-8072
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Reynold Xin
> Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit
> error message saying there are duplicate columns. See current error message
> below.
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---------------------------------------------------------------------------
> Py4JJavaError Traceback (most recent call last)
> <ipython-input-4-eecb85256898> in <module>()
> ----> 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path,
> mode)
> 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
> 351 """
> --> 352 self._jwrite.mode(mode).parquet(path)
> 353
> 354 @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
> in __call__(self, *args)
> 535 answer = self.gateway_client.send_command(command)
> 536 return_value = get_return_value(answer, self.gateway_client,
> --> 537 self.target_id, self.name)
> 538
> 539 for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
> in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could
> be: age#0L, age#3L.;
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
> at
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
> at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:127)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8.applyOrElse(Analyzer.scala:341)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8.applyOrElse(Analyzer.scala:243)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
> at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:243)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:242)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:61)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:59)
> at
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:59)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:51)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:51)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:903)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:903)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
> at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:98)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:920)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:920)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:338)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
> at
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:744)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]