Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20710
> Data source writers need to be able to reason about what progress they've
made, which is impossible in the streaming case if each epoch is its own
disconnected query.
I don't think the writers necessarily need to reason about progress. Are
you saying that there are guarantees the writers need to make, like ordering
how data appears?
I'm thinking of an implementation that creates a file for each task commit
and the driver's commit operation makes those available. That doesn't require
any progress tracking on tasks.
As far as a writer knowing that different epochs are part of the same
query: why? Is there something the writer needs to do? If so, then I think that
is more of an argument for a separate streaming interface, or else batch
implementations that ignore the epoch might do the wrong thing.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]