[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Xuefu Zhang (JIRA) Sun, 25 Jan 2015 07:21:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291134#comment-14291134
 ]


Xuefu Zhang commented on SPARK-2688:
------------------------------------

I think SPARK-3622 is related to this JIRA but not exactly the same. This JIRA 
essentially asks capability of executing a random DAG built of RDDs, while 
SPARK-3622 is requesting a custom transformation that can take one input RDD 
and generates multiple output RDDs. HIve on Spark projects needs this because 
HIve's map-side or reduce-side processing (which is translated to Spark's map 
functions) generates multiple outputs in general. On this ground, SPARK-3622 is 
more important than SPARK-2688.

It's worth to mention that such a custom transformation can be used in building 
a random DAG.


> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
>                 Key: SPARK-2688
>                 URL: https://issues.apache.org/jira/browse/SPARK-2688
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>            | -> rdd4
>            | -> rdd5
>            \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Reply via email to