Yeah, we saw this as well this morning, in a job that triggers checkpoints super fast (50msecs).
I think we have a good fix figured out, let's solve this for 1.0... On Tue, Jan 19, 2016 at 3:25 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > I just got back to this issue. The problem wasn't with the locking but that > the StreamTask wasn't in running state before the first checkpoint trigger > message. > I actually just saw your JIRA as well, funny... :) > > Regards, > Gyula > > Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P, > 15:36): > > > Hmm, strange issue indeed. > > > > So, checkpoints are definitely triggered (log message by coordinator to > > trigger checkpoint) but are not completing? > > Can you check which is the first checkpoint to complete? Is it Checkpoint > > 1, or a later one (indicating that checkpoint 1 was somehow subsumed). > > > > Can you check in the stacktrace on which lock the checkpoint runables are > > waiting, and who is holding that lock? > > > > Two thoughts: > > > > 1) What I mistakenly did once in one of my tests is to have the sleep() > in > > a downstream task. That would simply prevent the fast generated data > > elements (and the inline checkpoint barriers) from passing though and > > completing the checkpoint. > > > > 2) Is this another issue with the non-fair lock? Does the checkpoint > > runnable simply not get the lock before the checkpoint. Not sure why it > > would suddenly work after the failure. We could try and swap the lock > > Object by a "ReentrantLock(true)" and see what would happen. > > > > > > Stephan > > > > > > On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gyf...@apache.org> wrote: > > > > > Hey, > > > > > > I have encountered a weird issue in a checkpointing test I am trying to > > > write. The logic is the same as with the previous checkpointing tests, > > > there is a OnceFailingReducer. > > > > > > My problem is that before the reducer fails, my job cannot take any > > > snapshots. The Runnables executing the checkpointing logic in the > sources > > > keep waiting on some lock. > > > > > > After the failure and the restart, everything is fine and the > > checkpointing > > > can succeed properly. > > > > > > Also if I remove the failure from the reducer, the job doesnt take any > > > snapshots (waiting on lock) and the job will finish. > > > > > > Here is the code: > > > > > > > > > https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83 > > > > > > I assume there is no problem with the source as the Thread.sleep(..) is > > > outside of the synchronized block. (and as I said after the failure it > > > works fine). > > > > > > Any ideas? > > > > > > Thanks, > > > Gyula > > > > > >