Hey all,

I'm working on extending my pipeline runner so that multiple batches of
inputs are streamed into a single bundle (to amortize bundle setup time)
rather than just sending a single batch of inputs into the data channel and
then closing the data channel.

I'm using the ProcessBundleProgressRequest[1] to poll my workers to see if
they're idle and ready for more inputs using the `consuming_received_data`
field, but I noticed that `consuming_received_data` doesn't actually imply
whether or not the worker is busy or ready for work. Specifically, while
the worker is setting up (e.g. running `setup_bundle`),
`consuming_received_data` is set to False even though the worker is not
actually ready for work.

This makes it a bit awkward for runners to know whether or not it's safe to
send work to a worker. If `setup_bundle` takes a very long time, then a
runner might repeatedly send more and more work if it uses
`consuming_received_data` as an indicator of idleness. Am I misusing
`consuming_received_data`? Is there another more reliable way to detect
worker idleness?

cheers,
Joey

https://github.com/apache/beam/blob/357b7206f82f788b55732247c9096527f92adf4f/model/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_fn_api.proto#L476

Reply via email to