lukecwik commented on a change in pull request #12088: URL: https://github.com/apache/beam/pull/12088#discussion_r447128500
########## File path: sdks/python/apache_beam/transforms/util.py ########## @@ -741,6 +741,7 @@ def WithKeys(pcoll, k): @experimental() @typehints.with_input_types(Tuple[K, V]) [email protected]_output_types(Tuple[K, List[V]]) Review comment: > The Java implementation splits (key, value) pairs. > https://github.com/apache/beam/blob/eaa41cc4cbcc4f94d0ec1a36ff2b0f3fcee962f9/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L173-L174 > > I don't see that in Python - is the runner supposed to do that? This is a space saving optimization since the key is always going to be the same. Instead of having a bag of (<K, V1>, <K, V2>, <K, V3>, ...) we use two state cells, one storing K and the other storing (V1, V2, V3, ...) and combine them to recreate the output. It would be worthwhile to replicate this optimization in Python as well but it isn't necessary from a correctness point of view. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
