[
https://issues.apache.org/jira/browse/PIG-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606743#comment-14606743
]
liyunzhang_intel commented on PIG-4594:
---------------------------------------
[~mohitsabharwal]:
{quote}
In case 3 above (multiple splitees), looks like we could use RDD.cache() to
cache the output of b in your example.
Because, otherwise, since each Store corresponds to a Spark action, the entire
RDD lineage will computed twice, once for each Store.
{quote}
It seems that in [PigOnSpark MileStone
doc|https://docs.google.com/document/d/1R7O8BctJTHdMPlSy8A2imThRmhDtC2UB0HfWEsX2NGM/edit#heading=h.desnzoc5g4cs],
Re-design Spark Plan
Currently, the SparkLauncher converts the SparkPlan to RDD pipeline and
immediately executes it. There is no intermediate step that allows optimization
of the RDD pipeline, if so deemed necessary, before execution. This will need
re-working of sparkPlanToRDD(), perhaps by introduction of a RDDPlan of
RDDOperators.
I think after we implement redesigning sparkPlan, we can use RDD.cache() to
cache the output of b in the case3 to optimize.
Besides this suggestion, have you any other ideas about this patch?
> Enable "TestMultiQuery" in spark mode
> -------------------------------------
>
> Key: PIG-4594
> URL: https://issues.apache.org/jira/browse/PIG-4594
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4594.patch, PIG-4594_1.patch
>
>
> in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink,it shows
> that
> following unit test failures fail:
> org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1068
> org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1157
> org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1252
> org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1438
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)