nehsyc commented on a change in pull request #13292:
URL: https://github.com/apache/beam/pull/13292#discussion_r526309131
##########
File path: sdks/python/apache_beam/transforms/util.py
##########
@@ -780,6 +783,48 @@ def expand(self, pcoll):
self.max_buffering_duration_secs,
self.clock))
+ @typehints.with_input_types(Tuple[K, V])
+ @typehints.with_output_types(Tuple[K, Iterable[V]])
+ class WithShardedKey(PTransform):
+ """A GroupIntoBatches transform that outputs batched elements associated
+ with sharded input keys.
+
+ The sharding is determined by the runner to balance the load during the
Review comment:
In Dataflow, the load balancing is done inside Windmill during the
shuffle before the `GroupIntoBatchesDoFn`. Basically Windmill can re-distribute
the input element to different workers and assign them different shard ids
accordingly.
Different runner may choose different strategies by overriding the default
implementation here. We will add the transform to the runner api (in follow-up
PRs) so runners are able to recognize the transform and do something special if
they want.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]