HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming URL: https://github.com/apache/spark/pull/25618#issuecomment-526436430 Spark doesn't have semantics of 2PC natively as you've seen DSv2 API - Spark HDFS sink doesn't leverage 2PC. If I understand correctly, previously it used temporal directory - let all tasks write to that directory, and driver move that directory to final destination only when all tasks succeed to write. It leverages the fact that "rename" is atomic, so it didn't support "exactly-once" if underlying filesystem doesn't support atomic renaming. Now it leverages metadata - let all tasks write files, and pass the list of files (path) written to driver. When driver receives all list of written files from all tasks, driver writes overall list of files to metadata. So exactly-once for HDFS is only guaranteed when "Spark" reads the output which is aware of metadata information.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org