[GitHub] [spark] HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming

GitBox Thu, 29 Aug 2019 19:58:38 -0700

HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka 
EOS sink for Structured Streaming
URL: https://github.com/apache/spark/pull/25618#issuecomment-526436430
 
 
   Spark doesn't have semantics of 2PC natively as you've seen DSv2 API - Spark 
HDFS sink doesn't leverage 2PC. 
   
   If I understand correctly, previously it used temporal directory - let all 
tasks write to that directory, and driver move that directory to final 
destination only when all tasks succeed to write. It leverages the fact that 
"rename" is atomic, so it didn't support "exactly-once" if underlying 
filesystem doesn't support atomic renaming.
   
   Now it leverages metadata - let all tasks write files, and pass the list of 
files (path) written to driver. When driver receives all list of written files 
from all tasks, driver writes overall list of files to metadata. So 
exactly-once for HDFS is only guaranteed when "Spark" reads the output which is 
aware of metadata information.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming

Reply via email to