[ https://issues.apache.org/jira/browse/CRUNCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485924#comment-14485924 ]
Josh Wills commented on CRUNCH-510: ----------------------------------- Should have said: the relevant bit of code in that Spark Dataflow file is around line 162. > PCollection.materialize with Spark should use collect() > ------------------------------------------------------- > > Key: CRUNCH-510 > URL: https://issues.apache.org/jira/browse/CRUNCH-510 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Micah Whitacre > Assignee: Josh Wills > > When troubleshooting some other code noticed that when using the > SparkPipeline and the code forces a materialize() to be called... > {code} > delta = Aggregate.max(scores.parallelDo(new MapFn<Pair<String, > PageRankData>, Float>() { > @Override > public Float map(Pair<String, PageRankData> input) { > PageRankData prd = input.second(); > return Math.abs(prd.score - prd.lastScore); > } > }, ptf.floats())).getValue(); > {code} > That the underlying code actually results in writing out the value to HDFS: > {noformat} > 15/04/08 13:59:33 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile > at SparkRuntime.java:332, took 0.223622 s > {noformat} > Since Spark has the method collect() on RDDs, that should accomplish a > similar bit of functionality, I wonder if we could switch to use that and cut > down on the need to persist it to HDFS. I think this is currently happening > because of sharing logic between MRPipeline and SparkPipeline and have no > context about how we could possibly break it apart easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)