scwhittle commented on a change in pull request #16901:
URL: https://github.com/apache/beam/pull/16901#discussion_r812672510



##########
File path: 
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java
##########
@@ -195,6 +195,8 @@
   // retrieving extra work from Windmill without working on it, leading to 
better
   // prioritization / utilization.
   static final int MAX_WORK_UNITS_QUEUED = 100;
+  // Maximum bytes of WorkItems being processed in the work queue at a time.
+  static final int MAX_WORK_UNITS_BYTES = 500 << 20; // 500MB

Review comment:
       The argument that pipelines would always OOM if they exceeded 500MB 
active breaks down due to pipelines with limited parallelism. Streaming 
dataflow does not return additional work for keys that are already active on 
the worker.
   
   For example, if you have a pipeline writing to files with fixed shards and 
there are only 8 shards per worker, the current queued/active is 800MB (due to 
100MB work bundle limit).  Since there is by default ~4GB and configurably a 
lot more memory for the java harness such pipelines would not be ooming.
   
   So I'm still concerned that setting the default to 500MB could limit 
pipelines that previously worked.  Now that it is an option it is at least 
something we can tune but it still seems like it could cause issues and should 
perhaps be scaled based upon machine memory.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to