tvalentyn commented on code in PR #32082:
URL: https://github.com/apache/beam/pull/32082#discussion_r1718966770


##########
sdks/python/apache_beam/transforms/util.py:
##########
@@ -802,6 +802,13 @@ class BatchElements(PTransform):
   corresponding to its contents. Each batch is emitted with a timestamp at
   the end of their window.
 
+  When the max_batch_duration_secs arg is provided, a stateful implementation
+  of BatchElements is used to batch elements across bundles. This is most
+  impactful in streaming applications where many bundles only contain one
+  element. Larger max_batch_duration_secs values will reduce the throughput of

Review Comment:
   > Larger max_batch_duration_secs values will reduce the throughput
   
   is xput the right term here? I feel like longer duration should _increase_ 
xput, because we reduce the per-element overhead.  At least, if we measure xput 
for over a sufficiently long duration, say elements per hour.
   
   However the added latency might result in increased data freshness reading 
for downstream stages. 
https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf#data_freshness_streaming.
 
   
   WDYT about the following: 
   
   Larger max_batch_duration_secs values  might increase the overall the 
throughput of the transform, but might negatively impact the data freshness on 
downstream transforms due to added latency. Smaller values will  have less 
impact on data freshness, but might make batches  smaller than the target batch 
size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to