The precise interactions with the DataSourceV2 API haven't yet been hammered out in design. But much of this comes down to the core of Structured Streaming rather than the API details.
The execution engine handles checkpointing and recovery. It asks the streaming data source for offsets, and then determines that batch N contains the data between offset A and offset B. On recovery, if batch N needs to be re-run, the execution engine just asks the source for the same offset range again. Sources also get a handle to their own subfolder of the checkpoint, which they can use as scratch space if they need. For example, Spark's FileStreamReader keeps a log of all the files it's seen, so its offsets can be simply indices into the log rather than huge strings containing all the paths. SPARK-23323 is orthogonal. That commit coordinator is responsible for ensuring that, within a single Spark job, two different tasks can't commit the same partition. On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Wondering if this issue is related to SPARK-23323? > > > > Any pointers will be greatly appreciated…. > > > > Thanks, > > Jayesh > > > > *From: *"Thakrar, Jayesh" <jthak...@conversantmedia.com> > *Date: *Monday, April 23, 2018 at 9:49 PM > *To: *"dev@spark.apache.org" <dev@spark.apache.org> > *Subject: *Datasource API V2 and checkpointing > > > > I was wondering when checkpointing is enabled, who does the actual work? > > The streaming datasource or the execution engine/driver? > > > > I have written a small/trivial datasource that just generates strings. > > After enabling checkpointing, I do see a folder being created under the > checkpoint folder, but there's nothing else in there. > > > > Same question for write-ahead and recovery? > > And on a restart from a failed streaming session - who should set the > offsets? > > The driver/Spark or the datasource? > > > > Any pointers to design docs would also be greatly appreciated. > > > > Thanks, > > Jayesh > > >