HeartSaVioR commented on issue #23840: 
[WIP][DISCUSSION_NEEDED][SPARK-24295][SS] Add option to retain only last batch 
in file stream sink metadata
URL: https://github.com/apache/spark/pull/23840#issuecomment-467245737
 
 
   I'm now seeing that metadata path (within checkpoint root) is injected to 
only Sources, which requires DSv2 change on Sink side if we really want to 
incorporate Sink metadata to query checkpoint. I guess this would not happen if 
we have concrete and nice use case, and even it happens we can get the change 
only in Spark 3.0.0 and upwards.
   
   Let's see how other sink (KafkaSink) is implemented:
   
   
https://github.com/apache/spark/blob/4baa2d4449e103b15370d284b0ffdf09b4a9c1b7/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSink.scala#L26-L43
   
   it only leverages latestBatchId in memory, which means the rows could be 
requested to be written again when query restarts - it's OK because Kafka sink 
supports at-least-once.
   
   I guess we couldn't take same approach to achieve exactly-once in File 
Stream Sink. It would be at least achieving weak exactly-once via at-least-once 
with idempotent if rewriting batch is idempotent, but not 100% sure about it.
   
   Might be better to initiate discussion on dev. mailing list?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to