HeartSaVioR commented on issue #23840: [WIP][DISCUSSION_NEEDED][SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata URL: https://github.com/apache/spark/pull/23840#issuecomment-467245737 I'm now seeing that metadata path (within checkpoint root) is injected to only Sources, which requires DSv2 change on Sink side if we really want to incorporate Sink metadata to query checkpoint. I guess this would not happen if we have concrete and nice use case, and even it happens we can get the change only in Spark 3.0.0 and upwards. Let's see how other sink (KafkaSink) is implemented: https://github.com/apache/spark/blob/4baa2d4449e103b15370d284b0ffdf09b4a9c1b7/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSink.scala#L26-L43 it only leverages latestBatchId in memory, which means the rows could be requested to be written again when query restarts - it's OK because Kafka sink supports at-least-once. I guess we couldn't take same approach to achieve exactly-once in File Stream Sink. It would be at least achieving weak exactly-once via at-least-once with idempotent if rewriting batch is idempotent, but not 100% sure about it. Might be better to initiate discussion on dev. mailing list?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
