[
https://issues.apache.org/jira/browse/CRUNCH-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills updated CRUNCH-320:
------------------------------
Attachment: CRUNCH-320.patch
Here's a patch for this-- thanks for digging this up, and sorry for the trouble.
As a workaround for your example, you can call materialize() on rawInput and
Sample.sample(rawInput, 0.5) directly, and then call the PObject methods to get
their length. We'll only materialize the collection once, and that should
signal the outputs to the planner. (If you're using Crunch 0.9.0 or 0.8.2, we
added a cache() method to PCollection that makes this process more literate,
s.t. you could do:
rawInput.cache().length();
Sample.sample(rawInput, 0.5).cache().length();
to make the workaround a little bit cleaner.
> Materialize several PObject & PCollection objects in parallel (deferred
> materialization)
> ----------------------------------------------------------------------------------------
>
> Key: CRUNCH-320
> URL: https://issues.apache.org/jira/browse/CRUNCH-320
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Jason Gauci
> Assignee: Josh Wills
> Attachments: CRUNCH-320.patch
>
>
> Currently, Crunch blocks and materializes PCollections (through
> foo.materialize()) and PObjects (through foo.getValue()) on demand, but it
> would be a significant performance improvement if we could mark several of
> these objects as to be materialized, and then materialize all of them in
> parallel as part of a pipeline.run() call.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)