Micah Whitacre created CRUNCH-510: ------------------------------------- Summary: PCollection.materialize with Spark should use collect() Key: CRUNCH-510 URL: https://issues.apache.org/jira/browse/CRUNCH-510 Project: Crunch Issue Type: Improvement Components: Core Reporter: Micah Whitacre Assignee: Josh Wills
When troubleshooting some other code noticed that when using the SparkPipeline and the code forces a materialize() to be called... {code} delta = Aggregate.max(scores.parallelDo(new MapFn<Pair<String, PageRankData>, Float>() { @Override public Float map(Pair<String, PageRankData> input) { PageRankData prd = input.second(); return Math.abs(prd.score - prd.lastScore); } }, ptf.floats())).getValue(); {code} That the underlying code actually results in writing out the value to HDFS: {noformat} 15/04/08 13:59:33 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332, took 0.223622 s {noformat} Since Spark has the method collect() on RDDs, that should accomplish a similar bit of functionality, I wonder if we could switch to use that and cut down on the need to persist it to HDFS. I think this is currently happening because of sharing logic between MRPipeline and SparkPipeline and have no context about how we could possibly break it apart easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)