[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

rdblue Mon, 05 Mar 2018 17:06:11 -0800

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/20710
  
    > Data source writers need to be able to reason about what progress they've 
made, which is impossible in the streaming case if each epoch is its own 
disconnected query.
    
    I don't think the writers necessarily need to reason about progress. Are 
you saying that there are guarantees the writers need to make, like ordering 
how data appears?
    
    I'm thinking of an implementation that creates a file for each task commit 
and the driver's commit operation makes those available. That doesn't require 
any progress tracking on tasks.
    
    As far as a writer knowing that different epochs are part of the same 
query: why? Is there something the writer needs to do? If so, then I think that 
is more of an argument for a separate streaming interface, or else batch 
implementations that ignore the epoch might do the wrong thing.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

Reply via email to