The precise interactions with the DataSourceV2 API haven't yet been
hammered out in design. But much of this comes down to the core of
Structured Streaming rather than the API details.

The execution engine handles checkpointing and recovery. It asks the
streaming data source for offsets, and then determines that batch N
contains the data between offset A and offset B. On recovery, if batch N
needs to be re-run, the execution engine just asks the source for the same
offset range again. Sources also get a handle to their own subfolder of the
checkpoint, which they can use as scratch space if they need. For example,
Spark's FileStreamReader keeps a log of all the files it's seen, so its
offsets can be simply indices into the log rather than huge strings
containing all the paths.

SPARK-23323 is orthogonal. That commit coordinator is responsible for
ensuring that, within a single Spark job, two different tasks can't commit
the same partition.

On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Wondering if this issue is related to SPARK-23323?
>
>
>
> Any pointers will be greatly appreciated….
>
>
>
> Thanks,
>
> Jayesh
>
>
>
> *From: *"Thakrar, Jayesh" <jthak...@conversantmedia.com>
> *Date: *Monday, April 23, 2018 at 9:49 PM
> *To: *"dev@spark.apache.org" <dev@spark.apache.org>
> *Subject: *Datasource API V2 and checkpointing
>
>
>
> I was wondering when checkpointing is enabled, who does the actual work?
>
> The streaming datasource or the execution engine/driver?
>
>
>
> I have written a small/trivial datasource that just generates strings.
>
> After enabling checkpointing, I do see a folder being created under the
> checkpoint folder, but there's nothing else in there.
>
>
>
> Same question for write-ahead and recovery?
>
> And on a restart from a failed streaming session - who should set the
> offsets?
>
> The driver/Spark or the datasource?
>
>
>
> Any pointers to design docs would also be greatly appreciated.
>
>
>
> Thanks,
>
> Jayesh
>
>
>

Reply via email to