[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Sean Owen (JIRA) Sun, 25 Jan 2015 09:36:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291192#comment-14291192
 ]


Sean Owen commented on SPARK-2688:
----------------------------------

[~sandyr] Yes I can appreciate the difference between "pulling" N different 
RDDs from a single, persisted RDD, and "pushing" N different RDDs as the result 
of one pass over an RDD. (It's kind of interesting to think about what that 
means in terms of implementation -- it seems like this only works if the N 
output RDDs are persisted, since you have to immediately save the results.) 

This seems to be the topic of SPARK-3622, however. It does seem to me like a 
different and valid operation to consider implementing.

This JIRA seems to be only about avoiding recomputing an RDD that is the common 
parent of several RDDs. That's already possible, of course. It's possible 
without persisting to disk too. Excluding the use cases that seem described by 
SPARK-3622, then, I'm not sure what this covers, which is why I suggested 
closing this in favor of it.

> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
>                 Key: SPARK-2688
>                 URL: https://issues.apache.org/jira/browse/SPARK-2688
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>            | -> rdd4
>            | -> rdd5
>            \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Reply via email to