jerrypeng commented on PR #56314: URL: https://github.com/apache/spark/pull/56314#issuecomment-4633774793
@viirya thank you for the review. Addressing your comments inline > Comparison table: Continuous Processing should be "At-least-once", not "Exactly-once" I want to make a distinction between exactly-once processing guarantees and at-least-once delivery in sinks. Exactly-once processing guarantees means changes to state managed by the engine as a result of processing rows is applied **effectively** once. At-least-once delivery means output will be written to the external system at-least-once, i.e. duplicates possible. I think it is a important distinction to make. Real-time Mode offers exactly-once processing semantics just like the existing engine. The difference is in the sinks it supports. The only sink that supports exactly-once delivery is the delta sink (through idempotent writes). The kafka sink supports al-least-once delivery semantics regardless of whether real-time mode is used or not. This is an important distinction and I want to call this out in the documentation. In theory you can write an exactly-once sink for RTM, there is just no implementation of it yet. In regards to continuous mode, it does not support state so the argument is moot here. Let do this: 1. clearly define the terms 2. clarify what is supported in continuous mode. > The page should state that Real-time Mode is experimental I think this is a mistake. The real-time mode APIs are stable. Let me create a PR to remove the experimental annotation. I will address the minor nits as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
