Micah Whitacre created CRUNCH-510:
-------------------------------------
Summary: PCollection.materialize with Spark should use collect()
Key: CRUNCH-510
URL: https://issues.apache.org/jira/browse/CRUNCH-510
Project: Crunch
Issue Type: Improvement
Components: Core
Reporter: Micah Whitacre
Assignee: Josh Wills
When troubleshooting some other code noticed that when using the SparkPipeline
and the code forces a materialize() to be called...
{code}
delta = Aggregate.max(scores.parallelDo(new MapFn<Pair<String,
PageRankData>, Float>() {
@Override
public Float map(Pair<String, PageRankData> input) {
PageRankData prd = input.second();
return Math.abs(prd.score - prd.lastScore);
}
}, ptf.floats())).getValue();
{code}
That the underlying code actually results in writing out the value to HDFS:
{noformat}
15/04/08 13:59:33 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile at
SparkRuntime.java:332, took 0.223622 s
{noformat}
Since Spark has the method collect() on RDDs, that should accomplish a similar
bit of functionality, I wonder if we could switch to use that and cut down on
the need to persist it to HDFS. I think this is currently happening because of
sharing logic between MRPipeline and SparkPipeline and have no context about
how we could possibly break it apart easily.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)