jrmccluskey commented on code in PR #32082:
URL: https://github.com/apache/beam/pull/32082#discussion_r1725444041


##########
sdks/python/apache_beam/transforms/util.py:
##########
@@ -802,6 +802,13 @@ class BatchElements(PTransform):
   corresponding to its contents. Each batch is emitted with a timestamp at
   the end of their window.
 
+  When the max_batch_duration_secs arg is provided, a stateful implementation
+  of BatchElements is used to batch elements across bundles. This is most
+  impactful in streaming applications where many bundles only contain one
+  element. Larger max_batch_duration_secs values will reduce the throughput of

Review Comment:
   Potentially, yes. The slowdownis most pronounced when looking at the 
incomplete bundle case. If we weren't using the stateful batching, elements are 
emitted downstream as they arrive to BatchElements since they're single-element 
bundles. If nine elements arrive within the span of five seconds, you've 
emitted those nine elements in that span as well (abstracting away any overhead 
from the code emitting the bundles of 1.) Meanwhile, if we're statefully 
batching **and** the target batch size is greater than 9 **and** our maximum 
buffer time is greater than 5 seconds, we'd be emitting those 9 elements at a 
later time, but together. We're just artificially increasing the denominator in 
the fraction. The hope is that this tradeoff has some benefit to the downstream 
operation that is worth this bottleneck potential, but the documentation around 
that tradeoff was lacking prior



##########
sdks/python/apache_beam/transforms/util.py:
##########
@@ -802,6 +802,13 @@ class BatchElements(PTransform):
   corresponding to its contents. Each batch is emitted with a timestamp at
   the end of their window.
 
+  When the max_batch_duration_secs arg is provided, a stateful implementation
+  of BatchElements is used to batch elements across bundles. This is most
+  impactful in streaming applications where many bundles only contain one
+  element. Larger max_batch_duration_secs values will reduce the throughput of

Review Comment:
   Potentially, yes. The slowdown is most pronounced when looking at the 
incomplete bundle case. If we weren't using the stateful batching, elements are 
emitted downstream as they arrive to BatchElements since they're single-element 
bundles. If nine elements arrive within the span of five seconds, you've 
emitted those nine elements in that span as well (abstracting away any overhead 
from the code emitting the bundles of 1.) Meanwhile, if we're statefully 
batching **and** the target batch size is greater than 9 **and** our maximum 
buffer time is greater than 5 seconds, we'd be emitting those 9 elements at a 
later time, but together. We're just artificially increasing the denominator in 
the fraction. The hope is that this tradeoff has some benefit to the downstream 
operation that is worth this bottleneck potential, but the documentation around 
that tradeoff was lacking prior



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to