naddeoa commented on issue #24360: URL: https://github.com/apache/beam/issues/24360#issuecomment-1335928272
@tvalentyn thanks. Some questions/assumptions: 1. At a high level, what would an ideal value for this look like? I assume it has to be some percentage of the memory on the machine type that you're using for workers. Do you know how much memory its safe to consume purely in data? 2. Does the batch size lead to any disk usage as well (not counting whatever someone's custom DoFn might be doing) or does it just imply more memory usage. 3. The best user experience would be to have the batch size determined dynamically so no one has to think about it. Would that happen through size estimates? That might be getting ahead of myself since beam would have to know a lot about the memory available on the workers as well. 4. Why was 4096 chosen? At a glance, if we're talking about running Dataflow on your average BigQuery data in a schema with a dozen rows or text/numbers, then 4096 might not even be a mb. Wouldn't a more aggressive default still have been a safe estimate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
