curcur edited a comment on pull request #13648: URL: https://github.com/apache/flink/pull/13648#issuecomment-716973831
Hey @rkhachatryan and @pnowojski , thanks for the response. > We realized that if we check view reference in subpartition (or null out view.parent) then the downstream which view was overwritten by an older thread, will sooner or later be fenced by the upstream (or checkpoint will time out). > FLINK-19774 then becomes only an optimization. 1. Do you have preferences on `check view reference` vs `null out view.parent`? I have evaluated these two methods before, and I feel set view.parent -> null may be a cleaner way (that's why I proposed null out parent). Conceptually, it breaks the connection between the old view and its parent; Implementation wise, it can limit the (parent == null) handling mostly within `PipelinedApproximateSubpartition` or probably a little bit in `PipelinedSubpartition`, while in the reference check way, we have to change the interface and touch all the subpartitions that implements `PipelinedSubpartition`. Notice that not just `pollNext()` needs the check, everything that touches parent needs a check. 2. How this can solve the problem of a new view is replaced by an old one? Let's say downstream reconnects, asking for a new view; the new view is created; replaced by an old view that is triggered by an old handler event. The new view's parent is null-out. Then what will happen? I do not think "fenced by the upstream (or checkpoint will time out)" can **solve** this problem, it just ends with more failures caused by this problem. > So we can prevent most of the issues without touching deployment descriptors. > Therefore, I think it makes sense to implement it in this PR. Sorry for changing the decision. I am overall fine to put the check within this PR. However, may I ask how it is different from we have this PR as it is, and I do a follow-up one to null-out the parent? In this PR, my main purpose is to introduce a different result partition type, and scheduler changes are based upon this new type. That's the main reason I prefer to do it in a follow-up PR otherwise the scheduler part is blocked. And also, it is easier to review, and for me to focus on tests after null-out parent as well. > Another concern is a potential resource leak if downstream continuously fail without notifying the upstream (instances of CreditBasedSequenceNumberingViewReader will accumulate). Can you create a follow-up ticket for that? > This can be addressed by firing user event (see PartitionRequestQueue.userEventTriggered). Yes, that's a good point. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org