damccorm opened a new issue, #20496: URL: https://github.com/apache/beam/issues/20496
We have implemented some custom IO classes based on UnboundedReader/UnboundedSource. These work as expected, but while doing this I noticed a few things that didn't seem to be well documented and I'm not sure if they behave as would be anticipated. With the direct runner, when advance returns false repeatedly it appears as though direct runner will apply an increasing backoff to repeated calls to advance until it returns true, at which point the backoff is reset. This seems to be what I'd expect. However when the same code is used with Dataflow, advance will be called multiple times a second for a single given UnboundedSource instance with no backoff continuously. With more then one instance/worker this can start to produce additional CPU load. I'm a bit unclear what the right way to do this is, for example should you sleep in advance? I assume not, but it would be great if there was documentation around this interface, especially around the differing behavior of the various runners here and what the right way to implement this is to ensure efficient resource usage when no events are available from the underlying source. Imported from Jira [BEAM-10503](https://issues.apache.org/jira/browse/BEAM-10503). Original Jira may contain additional context. Reported by: ameihm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
