kkdoon commented on PR #28567: URL: https://github.com/apache/beam/pull/28567#issuecomment-1730436738
> I don't think we can just skip the check. The fact that the check fails implies there are still buffered data that have not been processed yet. We need to either: a) correctly flush and process the buffered data (which will clear the hold), or b) ensure that checkpoint is called before the final watermark > > I have a suspicion that correct is only b), because otherwise, in case of failure and restore from checkpoint after we flushed the data, we might break the stable input contract. I agree that option B is the cleanest and makes logical sense. I think in implementation terms it works if somehow [flushData](https://github.com/apache/beam/blob/79d0a8d11095522a036a6b3007f5fed4f6f46b3b/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java#L608) is invoked after last checkpoint finishes (or some other approach). Option A violates the requiresStableInput contract and hence seems incorrect (i did experiment with this approach as well and it gave the expected output when checkpoint successfully completed). Having said that, the current approach also works correctly when `requiresStableInput` is set since all the buffered data is emitted after checkpoint completes and the output watermark finally becomes equal to MAX_WATERMARK (i have run multiple jobs to verify the behavior). I can defer the `currentOutputWatermark < Long.MAX_VALUE` check for requiresStableInput scenario and do this check inside [notifyCheckpointComplete](https://github.com/apache/beam/blob/79d0a8d11095522a036a6b3007f5fed4f6f46b3b/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java#L1045), at the time of job completion, to verify that output watermark is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
