scwhittle commented on issue #28776: URL: https://github.com/apache/beam/issues/28776#issuecomment-2051652566
I believe this is a long-standing bug within the python sdk. Side inputs within the global window are cached in [PerWindowInvoker](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L948) without respecting the side input cache token. This is part of the bundle processor which is reused across bundles. The side input values are otherwise attempted with [reset](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/bundle_processor.py#L507) here or by the runner by modifying the side input cache token. Since bundle procesors are cached as long as there is a steady rate of input so that the last accessed time is less than 60 seconds [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py#L612), this can lead to extended periods where the captured global side input value is used without refresh. I think that we should remove the caching at the invoker level as it does not respect the cache token and the StateBackedSideInput supports caching itself. This may be a performance regression as the state cache is currently disabled by default though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
