[ https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282119#comment-14282119 ]
Bibudh Lahiri commented on SPARK-4689: -------------------------------------- I have added the following override method to SchemaRDD.scala (lines 311-315 in https://github.com/bibudhlahiri/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ): override def union(other: RDD[Row]): SchemaRDD = { logInfo("calling union in SchemaRDD") applySchema((super.union(other)).distinct()) } following the other overridden methods like intersection(). However, when I try the test code given in the following class: https://github.com/bibudhlahiri/spark/blob/master/dev/audit-release/sbt_app_schema_rdd/src/main/scala/SchemaRDDApp.scala the result of union-ing people and people1 at line 52 does not eliminate the duplicates, and from log messages that I put in the union(other: RDD[T]) method in RDD.scala, it is clear that the union() in RDD is being called (it is known it will keep the identical elements multiple times), and not the union in SchemaRDD. To investigate, I printed the class of the objects people and people1 in line 43, and they are MapPartitionsRDD (which is returned by the map() method of RDD) rather than SchemaRDD, although I was expecting it to be SchemaRDD because of the implicit method createSchemaRDD. To check whether similar situation arises for other methods too that have been overridden in SchemaRDD (like intersection()), I wrote some test code between lines 48-50, and it seems for that also the one in RDD is being called, although that is not creating a problem result-wise. Can you please suggest what I can do here, without making any change to the core classes like RDD? > Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java > -------------------------------------------------------------------------- > > Key: SPARK-4689 > URL: https://issues.apache.org/jira/browse/SPARK-4689 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.1.0 > Reporter: Chris Fregly > Priority: Minor > Labels: starter > > Currently, you need to use unionAll() in Scala. > Python does not expose this functionality at the moment. > The current work around is to use the UNION ALL HiveQL functionality detailed > here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org