[
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-2688.
------------------------------
Resolution: Won't Fix
> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
> Key: SPARK-2688
> URL: https://issues.apache.org/jira/browse/SPARK-2688
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Affects Versions: 1.0.1
> Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing:
> {code}
> rdd1 -> rdd2 -> rdd3
> | -> rdd4
> | -> rdd5
> \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 ->
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be
> recomputed. This is very inefficient. Ideally, we should be able to trigger
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a
> way doing so. Tez already realized the importance of this (TEZ-391), so I
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]