[
https://issues.apache.org/jira/browse/SPARK-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282119#comment-14282119
]
Bibudh Lahiri commented on SPARK-4689:
--------------------------------------
I have added the following override method to SchemaRDD.scala (lines 311-315 in
https://github.com/bibudhlahiri/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala
):
override def union(other: RDD[Row]): SchemaRDD =
{
logInfo("calling union in SchemaRDD")
applySchema((super.union(other)).distinct())
}
following the other overridden methods like intersection(). However, when I
try the test code given in the following class:
https://github.com/bibudhlahiri/spark/blob/master/dev/audit-release/sbt_app_schema_rdd/src/main/scala/SchemaRDDApp.scala
the result of union-ing people and people1 at line 52 does not eliminate the
duplicates, and from log messages that I put in the union(other: RDD[T]) method
in RDD.scala, it is clear that the union() in RDD is being called (it is known
it will keep the identical elements multiple times), and not the union in
SchemaRDD.
To investigate, I printed the class of the objects people and people1 in
line 43, and they are MapPartitionsRDD (which is returned by the map() method
of RDD) rather than SchemaRDD, although I was expecting it to be SchemaRDD
because of the implicit method createSchemaRDD.
To check whether similar situation arises for other methods too that have
been overridden in SchemaRDD (like intersection()), I wrote some test code
between lines 48-50, and it seems for that also the one in RDD is being called,
although that is not creating a problem result-wise.
Can you please suggest what I can do here, without making any change to
the core classes like RDD?
> Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java
> --------------------------------------------------------------------------
>
> Key: SPARK-4689
> URL: https://issues.apache.org/jira/browse/SPARK-4689
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.1.0
> Reporter: Chris Fregly
> Priority: Minor
> Labels: starter
>
> Currently, you need to use unionAll() in Scala.
> Python does not expose this functionality at the moment.
> The current work around is to use the UNION ALL HiveQL functionality detailed
> here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]