Shixiong Zhu created SPARK-28650:
------------------------------------
Summary: Fix the guarantee of ForeachWriter
Key: SPARK-28650
URL: https://issues.apache.org/jira/browse/SPARK-28650
Project: Spark
Issue Type: Documentation
Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu
Right now ForeachWriter has the following guarantee:
{code}
If the streaming query is being executed in the micro-batch mode, then every
partition
represented by a unique tuple (partitionId, epochId) is guaranteed to have the
same data.
Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally
commit data
and achieve exactly-once guarantees.
{code}
But we can break this easily actually when restarting a query but a batch is
re-run (e.g., upgrade Spark)
* Source returns a different DataFrame that has a different partition number
(e.g., we start to not create empty partitions in Kafka Source V2).
* A new added optimization rule may change the number of partitions in the new
run.
* Change the file split size in the new run.
Since we cannot guarantee that the same (partitionId, epochId) has the same
data. We should update the document for "ForeachWriter".
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]