[GitHub] [spark] gaborgsomogyi commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming

GitBox Mon, 09 Sep 2019 06:46:36 -0700

gaborgsomogyi commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS 
sink for Structured Streaming
URL: https://github.com/apache/spark/pull/25618#issuecomment-529482380
 
 
   So we've sit together with the Flink guys and had a deeper look at this area.
   
   Basically they have similar solution. Data is sent to Kafka with `flush`, 
then the producer ID and the transaction ID is stored in the checkpoint 
(producer ID is needed to have exactly one producer, transaction ID is needed 
to know what to commit/rollback). If any problem arise Flink can recover from 
there and can retry to commit the transaction. The whole flow makes the 
assumption that if that data was able to get generated and sent to Kafka then 
it's only a matter of time to commit it. In order to reach that the transaction 
timeout is increased significantly enough to cover a possible broker restart or 
so.
   
   I have mainly 2 concerns with this PR:
   * as @HeartSaVioR mentioned the actual PR is a lightweight version of 2PC 
(please see details above)
   * even if we say there is this contract between Spark and Kafka: `it's only 
matter of time to commit data` on the other side it may loose data with 
transaction timeout (Flink suffer from this also). I would not be so happy from 
user perspective if Spark would say: `We've lost some data but it's not Spark's 
fault but Kafka didn't do the commit in time`. I think it would be a good 
feature for both Spark and Flink if Kafka can somehow turn off transaction 
timeout.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gaborgsomogyi commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming

Reply via email to