Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/1668
Thanks for the reminder, I went over the code today. The code looks mostly
good, but here are some thoughts:
- The head task supports only one concurrent checkpoint. In general, the
tasks need to support multiple checkpoints being in progress at the same time.
It frequently happens when people trigger savepoints concurrent to a running
checkpoint. I think that is important to support.
- There tail task offers the elements to the blocking queue. That means
records are simply dropped if the capacity bound queue (one element) is not
polled by the head task in time.
- With the capacity bound in the feedback queue, it is pretty easy to
build a full deadlock. Just use a loop function that explodes data into the
feedback channel.
- Recent code also introduced the ability to change parallelism. What are
the semantics here when the parallelism of the loop is changed?
Since loops did not support any fault tolerance guarantees, I guess this
does improve recovery behavior. But as long as the loops can either deadlock or
drop data, the hard guarantees are in the end still a bit weak. So that leaves
me a bit ambivalent what to do with this pull request.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---