Thanks for the feedback guys. To be honest I'm no longer convinced this is the 
right thing to do. It does indeed decrease the CPU consumption significantly, 
however, at least in our case it is not enough. It turns out that even if the 
pipeline is completely empty, the driver goes

THROTTLE, THROTTLE, CONTINUE, THROTTLE, THROTTLE, CONTINUE, ... and so on ...

So effectively the active loop becomes loop with 15 ms sleep (average of 10 and 
20 ms). Because the code performed in the active phase is itself non-trivial, 
this still puts easily measurable load on the CPU. I was able to achieve some 
further minor improvements by doing some low-level changes in how the driver 
works with collections, but it became obvious that (at least in my quite 
specific use-case) this leads nowhere.

I was able to come up with an alternative (applicative) solution that simply 
blocks the DirectRunner threads when the pipeline is empty and only resumes the 
DirectRunner loop when new data enter the pipeline. 

I'll keep on thinking about this for a while yet and then probably close this 
PR unless I figure out how to make it really useful...




[ Full content available at: https://github.com/apache/beam/pull/6303 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to