[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142881#comment-14142881
 ] 

Xuefu Zhang edited comment on SPARK-3622 at 9/22/14 4:39 AM:
-------------------------------------------------------------

Thanks for your comments, [~pwendell]. I understand caching A would be helpful 
if I need to transform it to get B and C separately. My proposal is to get B 
and C just by one pass of A, so A doens't even need to be cached.

Here is an example how it may be used in Hive.
{code}
JavaPairRDD table = sparkContext.hadoopRDD(..);
Map<name, JavaPairRDD> mappedRDDs = table.mapPartitions(mapFunction);
JavaPairRDD rddA = mapperRDDs.get("A");
JavaPairRDD rddB = mapperRDDs.get("B");
JavaPairRDD sortedRddA = rddA.sortByKey();
javaPairRDD groupedRddB = rddB.groupByKey();
// further processing sortedRddA and groupedRddB.
...
{code}
In this case, mapFunction can return named iterators for A and B. B is 
automatically computed whenever A is computed, and vice versa. Since both are 
computed if any of them computed, subsequent reference to either one should not 
recompute any of them.

The benefits of it: 1) no need to cache A; 2) only one pass of the input.

I'm not sure if this is possible feasible in Spark, but Hive's map function is 
exactly doing this. It's operator tree can branch off anywhere, resulting 
multiple output datasets from a single input dataset.

Please let me know if there are more questions.



was (Author: xuefuz):
Thanks for your comments, [~pwendell]. I understand caching A would be helpful 
if I need to transform it to get B and C separately. My proposal is to get B 
and C just by one pass of A, so A doens't even need to be cached.

Here is an example how it may be used in Hive.
{code}
JavaPairRDD table = sparkContext.hadoopRDD(..);
Map<name, JavaPairRDD> mappedRDDs = table.mapPartitions(mapFunction);
JavaPairRDD rddA = mapperRDDs.get("A");
JavaPairRDD rddB = mapperRDDs.get("A");
JavaPairRDD sortedRddA = rddA.sortByKey();
javaPairRDD groupedRddB = rddB.groupByKey();
// further processing sortedRddA and groupedRddB.
...
{code}
In this case, mapFunction can return named iterators for A and B. B is 
automatically computed whenever A is computed, and vice versa. Since both are 
computed if any of them computed, subsequent reference to either one should not 
recompute any of them.

The benefits of it: 1) no need to cache A; 2) only one pass of the input.

I'm not sure if this is possible feasible in Spark, but Hive's map function is 
exactly doing this. It's operator tree can branch off anywhere, resulting 
multiple output datasets from a single input dataset.

Please let me know if there are more questions.


> Provide a custom transformation that can output multiple RDDs
> -------------------------------------------------------------
>
>                 Key: SPARK-3622
>                 URL: https://issues.apache.org/jira/browse/SPARK-3622
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to