[
https://issues.apache.org/jira/browse/CRUNCH-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470904#comment-13470904
]
Gabriel Reid commented on CRUNCH-88:
------------------------------------
[~jwills] Yep, I think we came to the same conclusion but different solutions.
I think there are a few issues with the patch you posted though.
The memory issues is definitely a worry, as loading all values under a single
key into memory will be a problem for some pipelines at my work. We're not
sending grouped tables to multiple output anywhere for the moment, but if we
were to try to do that (as I was when I ran into this), the memory overload
would be a showstopper.
The even bigger issue is object reuse. In most cases (i.e. any case where the
values in the iterable have the same type as their serialization type, which is
pretty much everything apart from primitive types and strings), the Iterable
just returns a copy of the same single object with updated state on each
iteration. The result is that the cached Iterable ends up being a list of
references to the same single object, with its state being the state of the
last-read values in the input Iterable.
We could get around this object reuse issue by using the PType#getDetachedValue
and create a deep copy of all values of the Iterable before sending it through
to child RTNodes, but that would mean that we'd need to have access to the
PType in RTNode. This would also double the memory usage of caching all values
per key.
The patch that I posted results in two parallel jobs being run to get around
this issue, which is obviously less efficient, but doesn't have these issues. I
was thinking that this could be done in a more efficient way in the future by
tagging records by which output path they would need to have before the
groupByKey (in line with the whole MCSR fusion approach), but didn't see that
as feasible (at least not for me) to do on the short term.
> Multiple parallelDos on a PGroupedTableImpl does not work
> ---------------------------------------------------------
>
> Key: CRUNCH-88
> URL: https://issues.apache.org/jira/browse/CRUNCH-88
> Project: Crunch
> Issue Type: Bug
> Affects Versions: 0.3.0
> Reporter: Gabriel Reid
> Assignee: Gabriel Reid
> Attachments: CRUNCH-88.patch, CRUNCH-88.patch
>
>
> Creating multiple distinct PCollections based on a single PGroupedTableImpl
> does not work correctly - the content of the PGroupedTableImpl will only be
> sent to a single outgoing PCollection, and all other PCollections that stem
> from the grouped table will not receive any data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira