[GitHub] [beam] naddeoa commented on issue #24360: [Feature Request]: Customize MAX_BATCH_SIZE in DoFn.process_batch

GitBox Fri, 02 Dec 2022 14:33:13 -0800


naddeoa commented on issue #24360:
URL: https://github.com/apache/beam/issues/24360#issuecomment-1335928272


   @tvalentyn thanks.
   
   Some questions/assumptions:
   
   1. At a high level, what would an ideal value for this look like? I assume 
it has to be some percentage of the memory on the machine type that you're 
using for workers. Do you know how much memory its safe to consume purely in 
data? 
   2. Does the batch size lead to any disk usage as well (not counting whatever 
someone's custom DoFn might be doing) or does it just imply more memory usage.
   3. The best user experience would be to have the batch size determined 
dynamically so no one has to think about it. Would that happen through size 
estimates? That might be getting ahead of myself since beam would have to know 
a lot about the memory available on the workers as well.
   4. Why was 4096 chosen? At a glance, if we're talking about running Dataflow 
on your average BigQuery data in a schema with a dozen rows or text/numbers, 
then 4096 might not even be a mb. Wouldn't a more aggressive default still have 
been a safe estimate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] naddeoa commented on issue #24360: [Feature Request]: Customize MAX_BATCH_SIZE in DoFn.process_batch

Reply via email to