Re: [DISCUSS] FLIP-147: Support Checkpoints After Tasks Finished

Aljoscha Krettek Wed, 06 Jan 2021 03:17:43 -0800

On 2021/01/06 11:30, Arvid Heise wrote:

I'm assuming that this is the normal case. In a A->B graph, as soon as A
finishes, B still has a couple of input buffers to process. If you add
backpressure or longer pipelines into the mix, it's quite likely that a
checkpoint may occur with B being the head.

Ahh, I think I know what you mean. This can happen when the checkpointcoordinator issues concurrent checkpoint without waiting for older onesto finish. My head is mostly operating under the premise that there isat most one concurrent checkpoint.

In the current code base the race conditions that Yun and I are talkingabout cannot occur. Checkpoints can only be triggered at sources andthey will then travel through the graph. Intermediate operators arenever directly triggered from the JobManager/CheckpointCoordinator.

When source start to shut down, the JM has to directly inject/triggercheckpoints at the now new "sources" of the graph, which have previouslybeen intermediate operators.

I want to repeat that I have a suspicion that maybe this is a degeneratecase and we never want to allow operators to be doing checkpoints whenthey are not connected to at least one running source. Which means thatwe have to find a solution for declined checkpoints, missing sources.

I'll first show an example where I think we will never have intermediateoperators running without the sources being running:


Source -> Map -> Sink

Here, when the Source does its final checkpoint and then shuts down,that same final checkpoint would travel downstream ahead of the EOF,which would in turn cause Map and Sink to also shut down. *We can't havethe case that Map is still running when we want to take a checkpoint andSource is not running*.


A similar case is this one:

Source1 --+
          |->Map -> Sink
Source2 --+

Here, if Source1 is finished but Source2 is not, Map is still connectedto at least one upstream source that is still running. Again. Map wouldnever be running and doing checkpoints if neither of Source1 or Source2are online.

The cases I see where intermediate operators would keep running despitenot being connected to any upstream operators are when we purposefullykeep an operator online despite all inputs having seen EOF. One exampleis async I/O, another is what Yun mentioned where a sink might want towait for another checkpoint to confirm some data. Example:


Source -> Async I/O -> Sink

Here, Async I/O will stay online as long as there are some scheduledrequests outstanding, even when the Source has shut down. In thosecases, the checkpoint coordinator would have to trigger new checkpointsat Async I/O and not Source, because it has become the new "head" of thegraph.

For Async I/O at least, we could say that the operator will wait for alloutstanding requests to finish before it allows the final checkpoint andpasses the barrier forward.


Best,
Aljoscha

Re: [DISCUSS] FLIP-147: Support Checkpoints After Tasks Finished

Reply via email to