[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Sandy Ryza (JIRA) Fri, 23 Jan 2015 11:41:51 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289813#comment-14289813
 ]


Sandy Ryza commented on SPARK-2688:
-----------------------------------

I agree that this is worth keeping open.  Allowing forks in a job's execution 
DAG without fully caching intermediate steps seems worthwhile to me.  This is 
something that other similar executions engines like Tez and Dryad support, 
and, more importantly, I've heard of a few valid use cases for it.

That said, I think this is fundamentally the same issue as SPARK-3622, so 
probably only worth keeping one of them.


> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
>                 Key: SPARK-2688
>                 URL: https://issues.apache.org/jira/browse/SPARK-2688
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>            | -> rdd4
>            | -> rdd5
>            \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

Reply via email to