HeartSaVioR commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter URL: https://github.com/apache/spark/pull/25407#discussion_r315454295
########## File path: docs/structured-streaming-programming-guide.md ########## @@ -2251,13 +2251,10 @@ When the streaming query is started, Spark calls the function or the object’s - The close() method (if it exists) is called if an open() method exists and returns successfully (irrespective of the return value), except if the JVM or Python process crashes in the middle. -- **Note:** The partitionId and epochId in the open() method can be used to deduplicate generated data - when failures cause reprocessing of some input data. This depends on the execution mode of the query. - If the streaming query is being executed in the micro-batch mode, then every partition represented - by a unique tuple (partition_id, epoch_id) is guaranteed to have the same data. - Hence, (partition_id, epoch_id) can be used to deduplicate and/or transactionally commit - data and achieve exactly-once guarantees. However, if the streaming query is being executed - in the continuous mode, then this guarantee does not hold and therefore should not be used for deduplication. +- **Note:** Spark does not guarantee same output for (partitionId, epochId) on failure, so deduplication Review comment: Great point! Updated. Btw, while I'm modifying the fault-tolerant of Foreach Sink from `Depends on the implementation` to `Yes (at-least-once)` as well, your screenshot seems to point out File Sink. Doesn't it guarantee exactly-once for corresponding Spark query via File Sink's specific metadata? I guess it guarantees unique write per batch. If that's not the case and you've found another broken fault-tolerance for File Sink, I feel it would be nice to have another JIRA (at least another PR) to track them separately, with description of new finding. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
