Shixiong Zhu created SPARK-28650:
------------------------------------

             Summary: Fix the guarantee of ForeachWriter
                 Key: SPARK-28650
                 URL: https://issues.apache.org/jira/browse/SPARK-28650
             Project: Spark
          Issue Type: Documentation
          Components: Structured Streaming
    Affects Versions: 2.4.3
            Reporter: Shixiong Zhu


Right now ForeachWriter has the following guarantee:

{code}

If the streaming query is being executed in the micro-batch mode, then every 
partition
represented by a unique tuple (partitionId, epochId) is guaranteed to have the 
same data.
Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally 
commit data
and achieve exactly-once guarantees.

{code}

 

But we can break this easily actually when restarting a query but a batch is 
re-run (e.g., upgrade Spark)
 * Source returns a different DataFrame that has a different partition number 
(e.g., we start to not create empty partitions in Kafka Source V2).
 * A new added optimization rule may change the number of partitions in the new 
run.
 * Change the file split size in the new run.

Since we cannot guarantee that the same (partitionId, epochId) has the same 
data. We should update the document for "ForeachWriter".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to