curcur edited a comment on pull request #13648:
URL: https://github.com/apache/flink/pull/13648#issuecomment-714599452
Thanks, @rkhachatryan for clarifying the problem!
I agree there is a problem if a downstream task continuously fails multiple
times, or an orphan task execution may exist for a short period of time after
new execution is running (as described in the FLIP).
Here is an idea of how to cleanly and thoroughly solve this kind of problem:
1. We go with the simplified release view version: only release view before
a new creation (in thread2). That says we won't clean up view when downstream
task disconnects (`releaseView` would not be called from the reference copy of
view) (in thread1 or 2).
- This would greatly simplify the threading model
- This won't cause any resource leak, since view release is only to
notify the upstream result partition to releaseOnConsumption when all
subpartitions are consumed in PipelinedSubPartitionView. In our case, we do not
release result partition on consumption any way (the result partition is put in
track in JobMaster, similar to the ResultParition.blocking Type).
2. Each view is associated with a downstream task execution version
- This is making sense because we actually have different versions of
view now, corresponding to the vertex.version of the downstream task.
- createView is performed only if the new version to create is greater
than the existing one
- If we decide to create a new view, the old view's parent (subpartition)
is set --> invalid
I think this way, we can completely disconnect the old view with the
subpartition. Besides that, the working handler in use would always hold the
freshest view reference.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]