gaborgsomogyi commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming URL: https://github.com/apache/spark/pull/25618#issuecomment-529482380 So we've sit together with the Flink guys and had a deeper look at this area. Basically they have similar solution. Data is sent to Kafka with `flush`, then the producer ID and the transaction ID is stored in the checkpoint (producer ID is needed to have exactly one producer, transaction ID is needed to know what to commit/rollback). If any problem arise Flink can recover from there and can retry to commit the transaction. The whole flow makes the assumption that if that data was able to get generated and sent to Kafka then it's only a matter of time to commit it. In order to reach that the transaction timeout is increased significantly enough to cover a possible broker restart or so. I have mainly 2 concerns with this PR: * as @HeartSaVioR mentioned the actual PR is a lightweight version of 2PC (please see details above) * even if we say there is this contract between Spark and Kafka: `it's only matter of time to commit data` on the other side it may loose data with transaction timeout (Flink suffer from this also). I would not be so happy from user perspective if Spark would say: `We've lost some data but it's not Spark's fault but Kafka didn't do the commit in time`. I think it would be a good feature for both Spark and Flink if Kafka can somehow turn off transaction timeout.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
