Thanks for the feedback guys. To be honest I'm no longer convinced this is the right thing to do. It does indeed decrease the CPU consumption significantly, however, at least in our case it is not enough. It turns out that even if the pipeline is completely empty, the driver goes
THROTTLE, THROTTLE, CONTINUE, THROTTLE, THROTTLE, CONTINUE, ... and so on ... So effectively the active loop becomes loop with 15 ms sleep (average of 10 and 20 ms). Because the code performed in the active phase is itself non-trivial, this still puts easily measurable load on the CPU. I was able to achieve some further minor improvements by doing some low-level changes in how the driver works with collections, but it became obvious that (at least in my quite specific use-case) this leads nowhere. I was able to come up with an alternative (applicative) solution that simply blocks the DirectRunner threads when the pipeline is empty and only resumes the DirectRunner loop when new data enter the pipeline. I'll keep on thinking about this for a while yet and then probably close this PR unless I figure out how to make it really useful... [ Full content available at: https://github.com/apache/beam/pull/6303 ] This message was relayed via gitbox.apache.org for [email protected]
