scwhittle commented on a change in pull request #16901:
URL: https://github.com/apache/beam/pull/16901#discussion_r812672510
##########
File path:
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java
##########
@@ -195,6 +195,8 @@
// retrieving extra work from Windmill without working on it, leading to
better
// prioritization / utilization.
static final int MAX_WORK_UNITS_QUEUED = 100;
+ // Maximum bytes of WorkItems being processed in the work queue at a time.
+ static final int MAX_WORK_UNITS_BYTES = 500 << 20; // 500MB
Review comment:
The argument that pipelines would always OOM if they exceeded 500MB
active breaks down due to pipelines with limited parallelism. Streaming
dataflow does not return additional work for keys that are already active on
the worker.
For example, if you have a pipeline writing to files with fixed shards and
there are only 8 shards per worker, the current queued/active is 800MB (due to
100MB work bundle limit). Since there is by default ~4GB and configurably a
lot more memory for the java harness such pipelines would not be ooming.
So I'm still concerned that setting the default to 500MB could limit
pipelines that previously worked. Now that it is an option it is at least
something we can tune but it still seems like it could cause issues and should
perhaps be scaled based upon machine memory.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]