[ 
https://issues.apache.org/jira/browse/CRUNCH-88?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Wills updated CRUNCH-88:
-----------------------------

    Attachment: CRUNCH-88.patch

So it turned out to be an execution problem, not a planning problem. If a 
groupByKey has multiple children, the first child can consume the output of all 
of the Iterable<V> values before the other children get a chance to process 
them. The solution I implemented detects when we're in this situation and 
caches the Iterable<V> in memory so it can be processed by each child in turn. 
I imagine we'll need to make it more clever over time (to support, e.g., 
spilling to disk), but this fixes the immediate problem.
                
> Multiple parallelDos on a PGroupedTableImpl does not work
> ---------------------------------------------------------
>
>                 Key: CRUNCH-88
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-88
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-88.patch, CRUNCH-88.patch
>
>
> Creating multiple distinct PCollections based on a single PGroupedTableImpl 
> does not work correctly - the content of the PGroupedTableImpl will only be 
> sent to a single outgoing PCollection, and all other PCollections that stem 
> from the grouped table will not receive any data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to