[ 
https://issues.apache.org/jira/browse/CRUNCH-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470904#comment-13470904
 ] 

Gabriel Reid commented on CRUNCH-88:
------------------------------------

[~jwills] Yep, I think we came to the same conclusion but different solutions. 
I think there are a few issues with the patch you posted though.

The memory issues is definitely a worry, as loading all values under a single 
key into memory will be a problem for some pipelines at my work. We're not 
sending grouped tables to multiple output anywhere for the moment, but if we 
were to try to do that (as I was when I ran into this), the memory overload 
would be a showstopper.

The even bigger issue is object reuse. In most cases (i.e. any case where the 
values in the iterable have the same type as their serialization type, which is 
pretty much everything apart from primitive types and strings), the Iterable 
just returns a copy of the same single object with updated state on each 
iteration. The result is that the cached Iterable ends up being a list of 
references to the same single object, with its state being the state of the 
last-read values in the input Iterable. 

We could get around this object reuse issue by using the PType#getDetachedValue 
and create a deep copy of all values of the Iterable before sending it through 
to child RTNodes, but that would mean that we'd need to have access to the 
PType in RTNode. This would also double the memory usage of caching all values 
per key.

The patch that I posted results in two parallel jobs being run to get around 
this issue, which is obviously less efficient, but doesn't have these issues. I 
was thinking that this could be done in a more efficient way in the future by 
tagging records by which output path they would need to have before the 
groupByKey (in line with the whole MCSR fusion approach), but didn't see that 
as feasible (at least not for me) to do on the short term.
                
> Multiple parallelDos on a PGroupedTableImpl does not work
> ---------------------------------------------------------
>
>                 Key: CRUNCH-88
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-88
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-88.patch, CRUNCH-88.patch
>
>
> Creating multiple distinct PCollections based on a single PGroupedTableImpl 
> does not work correctly - the content of the PGroupedTableImpl will only be 
> sent to a single outgoing PCollection, and all other PCollections that stem 
> from the grouped table will not receive any data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to