[
https://issues.apache.org/jira/browse/CRUNCH-88?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills updated CRUNCH-88:
-----------------------------
Attachment: CRUNCH-88.patch
So it turned out to be an execution problem, not a planning problem. If a
groupByKey has multiple children, the first child can consume the output of all
of the Iterable<V> values before the other children get a chance to process
them. The solution I implemented detects when we're in this situation and
caches the Iterable<V> in memory so it can be processed by each child in turn.
I imagine we'll need to make it more clever over time (to support, e.g.,
spilling to disk), but this fixes the immediate problem.
> Multiple parallelDos on a PGroupedTableImpl does not work
> ---------------------------------------------------------
>
> Key: CRUNCH-88
> URL: https://issues.apache.org/jira/browse/CRUNCH-88
> Project: Crunch
> Issue Type: Bug
> Affects Versions: 0.3.0
> Reporter: Gabriel Reid
> Assignee: Gabriel Reid
> Attachments: CRUNCH-88.patch, CRUNCH-88.patch
>
>
> Creating multiple distinct PCollections based on a single PGroupedTableImpl
> does not work correctly - the content of the PGroupedTableImpl will only be
> sent to a single outgoing PCollection, and all other PCollections that stem
> from the grouped table will not receive any data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira