[
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279602#comment-14279602
]
Imran Rashid commented on SPARK-3622:
-------------------------------------
In some ways this kinda reminds of the problem w/ accumulators and lazy
transformations. Accumulators are basically multiple output, but Spark itself
provides no way to track when that output is ready. Its up to the developer to
figure it out.
If you do a transformation on {{rddA}} you've got to know to "wait" until
you've also got a transformation on {{rddB}} ready as well. Probably the
simplest case for this is filtering records by some condition, but keeping both
the good and bad records, ala scala collection's {{partition}} method. I think
this has come up on the user mailing list a few times.
What about having some new type {{MultiRDD}}, which only runs when you've
queued up an action on *all* RDDs? eg. something like:
{code}
val input: RDD[String] = ...
val goodAndBad: MultiRdd[String, String] = input.partition{ str =>
MyRecordParser.isOk(str)}
val bad: RDD[String] = goodAndBad.get(1)
bad.saveAsTextFile(...) // doesn't do anything yet
val parsed: RDD[MyCaseClass] = goodAndBad.get(0).map{str =>
MyRecordParser.parse(str)}
val tmp: RDD[MyCaseClass] = parsed.map{f1}.filter{f2}.mapPartitions{f3} //still
don't do anything ...
val result = tmp.reduce{reduceFunc} // now everything gets run
{code}
> Provide a custom transformation that can output multiple RDDs
> -------------------------------------------------------------
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.1.0
> Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those
> which takes user-supplied functions such as mapPartitions() . However,
> sometimes a user provided function may need to output multiple RDDs. For
> instance, a filter function that divides the input RDD into serveral RDDs.
> While it's possible to get multiple RDDs by transforming the same RDD
> multiple times, it may be more efficient to do this concurrently in one shot.
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function
> can output different data sets to be consumed by subsequent stages.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]