[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

Imran Rashid (JIRA) Thu, 15 Jan 2015 16:45:06 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279602#comment-14279602
 ]


Imran Rashid commented on SPARK-3622:
-------------------------------------

In some ways this kinda reminds of the problem w/ accumulators and lazy 
transformations.  Accumulators are basically multiple output, but Spark itself 
provides no way to track when that output is ready.  Its up to the developer to 
figure it out.

If you do a transformation on {{rddA}} you've got to know to "wait" until 
you've also got a transformation on {{rddB}} ready as well.  Probably the 
simplest case for this is filtering records by some condition, but keeping both 
the good and bad records, ala scala collection's {{partition}} method.  I think 
this has come up on the user mailing list a few times.

What about having some new type {{MultiRDD}}, which only runs when you've 
queued up an action on *all* RDDs?  eg. something like:

{code}
val input: RDD[String] = ...
val goodAndBad: MultiRdd[String, String] = input.partition{ str => 
MyRecordParser.isOk(str)}
val bad: RDD[String] = goodAndBad.get(1)
bad.saveAsTextFile(...) // doesn't do anything yet
val parsed: RDD[MyCaseClass] = goodAndBad.get(0).map{str => 
MyRecordParser.parse(str)}
val tmp: RDD[MyCaseClass] = parsed.map{f1}.filter{f2}.mapPartitions{f3} //still 
don't do anything ...
val result = tmp.reduce{reduceFunc} // now everything gets run
{code}

> Provide a custom transformation that can output multiple RDDs
> -------------------------------------------------------------
>
>                 Key: SPARK-3622
>                 URL: https://issues.apache.org/jira/browse/SPARK-3622
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

Reply via email to