Hi Igniters, hi Alexey. I want to discuss this issue: https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it too.
I was able to determine where there is a race. The update of the heartbeat happens asynchronously into the listener code. But we always wait in the checkpoint thread for all pending async tasks. And this is reasonable. for (CheckpointListener lsnr : dbLsnrs) lsnr.beforeCheckpointBegin(ctx0); ctx0.awaitPendingTasksFinished(); The race was because of inappropriate order of future registration. In CheckpointContextImpl.executor () (inside listeners execution) GridFutureAdapter<?> res = new GridFutureAdapter<>(); res.listen(fut -> heartbeatUpdater.updateHeartbeat()); asyncRunner.execute(U.wrapIgniteFuture(cmd, res)); pendingTaskFuture.add(res); Here we create a task, submit a task to the executor, and only after this do we register the task. Thus we got a situation where checkpointer thread was moving on after ctx0.awaitPendingTasksFinished(); and still, the unregistered asyncRunner task was moving on in parallel. But anyway, I propose to remove the update of the heartbeat from other threads altogether and wrap the call to listeners in a blockingSection. As I understand heartbeat was designed just to indicate self-progress by a worker. If a worker can not indicate self-progress we should wrap such code into blockingSections. In case of listeners, worker can not indicate self-progress, thus let's wrap it into blockingSection. Guys, what do you think about this? ------------- Ilya Kazakov
