HeartSaVioR commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS 
sink for Structured Streaming
URL: https://github.com/apache/spark/pull/25618#issuecomment-526394910
 
 
   Just skimmed the design doc (need to take a look deeply on fault tolerance) 
and it's basically known approach what Flink is doing (2PC). Please mention 
what you've inspired of, for same reason, credit.
   
   I was planning to propose similar before, more clearly I've asked to support 
2PC in DSv2 API level as Spark doesn't support 2PC natively, but feedback 
wasn't positive as it should be very invasive change on Spark codebase. There 
has been more cases asking for exactly-once write, and I guess the common 
answer was leveraging intermediate output. While some storage can leverage it 
(e.g. RDBMS - writers write to temp table, driver copies rows that writers 
reported to output table), it doesn't make sense for Kafka, at least 
performance reason, as there's no way to let Kafka copies its records from 
topic A to topic B (right?), so I gave up.
   
   If the code change implements 2PC correctly, in general I guess it would 
work in many cases, though as it's explained that transaction timeout leads 
data loss. I've indicated the issue on transaction timeout when I designed it 
and that was also one of major concerns as well. When the producer writes 
something it must be committed within timeout in any kinds of failures, 
otherwise data loss happen. Even we decide to invalidate that batch and rerun 
the batch, we're now then "at-least-once". (I'm wondering Flink's Kafka 
producer with 2PC also has similar issue or they have some safeguard.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to