lukecwik commented on a change in pull request #12088:
URL: https://github.com/apache/beam/pull/12088#discussion_r447128500



##########
File path: sdks/python/apache_beam/transforms/util.py
##########
@@ -741,6 +741,7 @@ def WithKeys(pcoll, k):
 
 @experimental()
 @typehints.with_input_types(Tuple[K, V])
[email protected]_output_types(Tuple[K, List[V]])

Review comment:
       > The Java implementation splits (key, value) pairs.
   > 
https://github.com/apache/beam/blob/eaa41cc4cbcc4f94d0ec1a36ff2b0f3fcee962f9/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L173-L174
   > 
   > I don't see that in Python - is the runner supposed to do that?
   
   This is a space saving optimization since the key is always going to be the 
same. Instead of having a bag of (<K, V1>, <K, V2>, <K, V3>, ...) we use two 
state cells, one storing K and the other storing (V1, V2, V3, ...) and combine 
them to recreate the output. It would be worthwhile to replicate this 
optimization in Python as well but it isn't necessary from a correctness point 
of view.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to