Yeah, we saw this as well this morning, in a job that triggers checkpoints
super fast (50msecs).

I think we have a good fix figured out, let's solve this for 1.0...

On Tue, Jan 19, 2016 at 3:25 PM, Gyula Fóra <gyula.f...@gmail.com> wrote:

> I just got back to this issue. The problem wasn't with the locking but that
> the StreamTask wasn't in running state before the first checkpoint trigger
> message.
> I actually just saw your JIRA as well, funny... :)
>
> Regards,
> Gyula
>
> Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P,
> 15:36):
>
> > Hmm, strange issue indeed.
> >
> > So, checkpoints are definitely triggered (log message by coordinator to
> > trigger checkpoint) but are not completing?
> > Can you check which is the first checkpoint to complete? Is it Checkpoint
> > 1, or a later one (indicating that checkpoint 1 was somehow subsumed).
> >
> > Can you check in the stacktrace on which lock the checkpoint runables are
> > waiting, and who is holding that lock?
> >
> > Two thoughts:
> >
> > 1) What I mistakenly did once in one of my tests is to have the sleep()
> in
> > a downstream task. That would simply prevent the fast generated data
> > elements (and the inline checkpoint barriers) from passing though and
> > completing the checkpoint.
> >
> > 2) Is this another issue with the non-fair lock? Does the checkpoint
> > runnable simply not get the lock before the checkpoint. Not sure why it
> > would suddenly work after the failure. We could try and swap the lock
> > Object by a "ReentrantLock(true)" and see what would happen.
> >
> >
> > Stephan
> >
> >
> > On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gyf...@apache.org> wrote:
> >
> > > Hey,
> > >
> > > I have encountered a weird issue in a checkpointing test I am trying to
> > > write. The logic is the same as with the previous checkpointing tests,
> > > there is a OnceFailingReducer.
> > >
> > > My problem is that before the reducer fails, my job cannot take any
> > > snapshots. The Runnables executing the checkpointing logic in the
> sources
> > > keep waiting on some lock.
> > >
> > > After the failure and the restart, everything is fine and the
> > checkpointing
> > > can succeed properly.
> > >
> > > Also if I remove the failure from the reducer, the job doesnt take any
> > > snapshots (waiting on lock) and the job will finish.
> > >
> > > Here is the code:
> > >
> > >
> >
> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
> > >
> > > I assume there is no problem with the source as the Thread.sleep(..) is
> > > outside of the synchronized block. (and as I said after the failure it
> > > works fine).
> > >
> > > Any ideas?
> > >
> > > Thanks,
> > > Gyula
> > >
> >
>

Reply via email to