[
https://issues.apache.org/jira/browse/CRUNCH-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864474#comment-13864474
]
Jason Gauci edited comment on CRUNCH-320 at 1/7/14 6:10 PM:
------------------------------------------------------------
It does not work this way. Here is an example: ( http://pastebin.com/MibssGcC )
public static void main(String[] args) {
Pipeline pipeline = new MRPipeline(GetSessionLength.class);
PCollection<String> rawInput = pipeline
.readTextFile("/user/jgauci/CommonCrawl/textData-00000");
PObject<Long> length = rawInput.length();
PObject<Long> sampledLength = Sample.sample(rawInput,
0.5).length();
System.out.println(System.currentTimeMillis() + " Running
pipeline");
pipeline.run();
System.out.println(System.currentTimeMillis() + " Finished
running pipeline. Evaluating deferred pobjects");
System.out.println(System.currentTimeMillis() + "Lengths: " +
length.getValue() + " " + sampledLength.getValue());
pipeline.done();
}
The pipeline.run() is a no-op, and the work happens in the getValue() functions
below. Also, each getValue() blocks so they cannot run in parallel.
was (Author: jgmath2000):
It does not work this way. Here is an example:
public static void main(String[] args) {
Pipeline pipeline = new MRPipeline(GetSessionLength.class);
PCollection<String> rawInput = pipeline
.readTextFile("/user/jgauci/CommonCrawl/textData-00000");
PObject<Long> length = rawInput.length();
PObject<Long> sampledLength = Sample.sample(rawInput,
0.5).length();
System.out.println(System.currentTimeMillis() + " Running
pipeline");
pipeline.run();
System.out.println(System.currentTimeMillis() + " Finished
running pipeline. Evaluating deferred pobjects");
System.out.println(System.currentTimeMillis() + "Lengths: " +
length.getValue() + " " + sampledLength.getValue());
pipeline.done();
}
The pipeline.run() is a no-op, and the work happens in the getValue() functions
below. Also, each getValue() blocks so they cannot run in parallel.
> Materialize several PObject & PCollection objects in parallel (deferred
> materialization)
> ----------------------------------------------------------------------------------------
>
> Key: CRUNCH-320
> URL: https://issues.apache.org/jira/browse/CRUNCH-320
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Jason Gauci
> Assignee: Josh Wills
>
> Currently, Crunch blocks and materializes PCollections (through
> foo.materialize()) and PObjects (through foo.getValue()) on demand, but it
> would be a significant performance improvement if we could mark several of
> these objects as to be materialized, and then materialize all of them in
> parallel as part of a pipeline.run() call.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)