[ 
https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282119#comment-14282119
 ] 

Bibudh Lahiri commented on SPARK-4689:
--------------------------------------

I have added the following override method to SchemaRDD.scala (lines 311-315 in 
https://github.com/bibudhlahiri/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala
 ):

override def union(other: RDD[Row]): SchemaRDD =
  {
    logInfo("calling union in SchemaRDD")
    applySchema((super.union(other)).distinct())
  }

  following the other overridden methods like intersection(). However, when I 
try the test code given in the following class:

  
https://github.com/bibudhlahiri/spark/blob/master/dev/audit-release/sbt_app_schema_rdd/src/main/scala/SchemaRDDApp.scala

   the result of union-ing people and people1 at line 52 does not eliminate the 
duplicates, and from log messages that I put in the union(other: RDD[T]) method 
in RDD.scala, it is clear that the union() in RDD is being called (it is known 
it will keep the identical elements multiple times), and not the union in 
SchemaRDD.

    To investigate, I printed the class of the objects people and people1 in 
line 43, and they are MapPartitionsRDD (which is returned by the map() method 
of RDD) rather than SchemaRDD, although I was expecting it to be SchemaRDD 
because of the implicit method createSchemaRDD.

    To check whether similar situation arises for other methods too that have 
been overridden in SchemaRDD (like intersection()), I wrote some test code 
between lines 48-50, and it seems for that also the one in RDD is being called, 
although that is not creating a problem result-wise.
    
     Can you please suggest what I can do here, without making any change to 
the core classes like RDD?

> Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
> --------------------------------------------------------------------------
>
>                 Key: SPARK-4689
>                 URL: https://issues.apache.org/jira/browse/SPARK-4689
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: Chris Fregly
>            Priority: Minor
>              Labels: starter
>
> Currently, you need to use unionAll() in Scala.  
> Python does not expose this functionality at the moment.
> The current work around is to use the UNION ALL HiveQL functionality detailed 
> here:  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to