[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery

GitBox Mon, 26 Oct 2020 21:51:02 -0700


curcur edited a comment on pull request #13648:
URL: https://github.com/apache/flink/pull/13648#issuecomment-716973831



   Hey @rkhachatryan and @pnowojski , thanks for the response.
   
   > We realized that if we check view reference in subpartition (or null out 
view.parent) then the downstream which view was overwritten by an older thread, 
will sooner or later be fenced by the upstream (or checkpoint will time out).
   > FLINK-19774 then becomes only an optimization.
   
   1. Do you have preferences on `check view reference` vs `null out 
view.parent`? 
   I have evaluated these two methods before, and I feel set view.parent -> 
null may be a cleaner way (that's why I proposed null out parent). 
Conceptually, it breaks the connection between the old view and its parent; 
Implementation wise, it can limit the (parent == null) handling mostly within 
`PipelinedApproximateSubpartition` or probably a little bit in 
`PipelinedSubpartition`, while in the reference check way, we have to change 
the interface and touch all the subpartitions that implements 
`PipelinedSubpartition`. Notice that not just `pollNext()` needs the check, 
everything that touches parent needs a check.
   
   2. How this can solve the problem of a new view is replaced by an old one? 
Let's say downstream reconnects, asking for a new view; the new view is 
created; replaced by an old view that is triggered by an old handler event. The 
new view's parent is null-out. Then what will happen? I do not think "fenced by 
the upstream (or checkpoint will time out)" can **solve** this problem, it just 
ends with more failures caused by this problem.
   
   > So we can prevent most of the issues without touching deployment 
descriptors.
   > Therefore, I think it makes sense to implement it in this PR. Sorry for 
changing the decision.
   
   I am overall fine to put the check within this PR. However, may I ask how it 
is different from we have this PR as it is, and I do a follow-up one to 
null-out the parent? 
   In this PR, my main purpose is to introduce a different result partition 
type, and scheduler changes are based upon this new type. That's the main 
reason I prefer to do it in a follow-up PR otherwise the scheduler part is 
blocked.
   
   And also, it is easier to review, and for me to focus on tests after 
null-out parent as well.
   
   > Another concern is a potential resource leak if downstream continuously 
fail without notifying the upstream (instances of 
CreditBasedSequenceNumberingViewReader will accumulate). Can you create a 
follow-up ticket for that?
   > This can be addressed by firing user event (see 
PartitionRequestQueue.userEventTriggered).
   
   Yes, that's a good point.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery

Reply via email to