[ 
https://issues.apache.org/jira/browse/SPARK-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903510#comment-16903510
 ] 

Jungtaek Lim commented on SPARK-28650:
--------------------------------------

It sounds like either correctness or data-loss, and most of cases end users 
should change their implementation of open method to always return true for 
safety.

Are you planning to work on this? If you don't plan to address this sooner, I'd 
like to take this up.

Given we've changed guarantee, do we want to keep the signature of "open" as it 
is? By leaving it as it is, we still give a chance to skip writing but 
according to the guarantee it only makes sense when skipping whole batch.

> Fix the guarantee of ForeachWriter
> ----------------------------------
>
>                 Key: SPARK-28650
>                 URL: https://issues.apache.org/jira/browse/SPARK-28650
>             Project: Spark
>          Issue Type: Documentation
>          Components: Structured Streaming
>    Affects Versions: 2.4.3
>            Reporter: Shixiong Zhu
>            Priority: Major
>
> Right now ForeachWriter has the following guarantee:
> {code}
> If the streaming query is being executed in the micro-batch mode, then every 
> partition
> represented by a unique tuple (partitionId, epochId) is guaranteed to have 
> the same data.
> Hence, (partitionId, epochId) can be used to deduplicate and/or 
> transactionally commit data
> and achieve exactly-once guarantees.
> {code}
>  
> But we can break this easily actually when restarting a query but a batch is 
> re-run (e.g., upgrade Spark)
>  * Source returns a different DataFrame that has a different partition number 
> (e.g., we start to not create empty partitions in Kafka Source V2).
>  * A new added optimization rule may change the number of partitions in the 
> new run.
>  * Change the file split size in the new run.
> Since we cannot guarantee that the same (partitionId, epochId) has the same 
> data. We should update the document for "ForeachWriter".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to