dongjoon-hyun commented on a change in pull request #25407:
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312746952
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
##########
@@ -50,14 +50,13 @@ import org.apache.spark.annotation.Evolving
*
* Important points to note:
* <ul>
- * <li>The `partitionId` and `epochId` can be used to deduplicate generated
data when failures
- * cause reprocessing of some input data. This depends on the execution
mode of the query. If
- * the streaming query is being executed in the micro-batch mode, then
every partition
- * represented by a unique tuple (partitionId, epochId) is guaranteed to
have the same data.
- * Hence, (partitionId, epochId) can be used to deduplicate and/or
transactionally commit data
- * and achieve exactly-once guarantees. However, if the streaming query is
being executed in the
- * continuous mode, then this guarantee does not hold and therefore should
not be used for
- * deduplication.
+ * <li>Spark doesn't guarantee same output for (partitionId, epochId) on
failure, so deduplication
+ * cannot be achieved with (partitionId, epochId). Refer SPARK-28650 for
more details.
Review comment:
SPARK-28650 has only the following content, could we remove this `Refer
SPARK-28650 for more details` by embedding this information?
> But we can break this easily actually when restarting a query but a batch
is re-run (e.g., upgrade Spark)
Source returns a different DataFrame that has a different partition number
(e.g., we start to not create empty partitions in Kafka Source V2).
A new added optimization rule may change the number of partitions in the new
run.
Change the file split size in the new run.
Since we cannot guarantee that the same (partitionId, epochId) has the same
data. We should update the document for "ForeachWriter".
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]