Github user rangadi commented on the issue: https://github.com/apache/flink/pull/4239 May be an extra shuffle to make small batches could help. Another option is to buffer all the records in state and write them all inside commit(). But not sure how costly it is to save all the records in checkpointed state. Another issue I see with using random txn id : if a worker looks unresponsive and work is moved to another worker, it is possible that the old worker still lingers around with open transaction. That would imply it the exactly-once consumers can not read past that transaction as long as it is open. I didn't know it was possible to resume a transaction since it was not part of producer API. This PR uses an undocumented way to do it.. do you know if Kafka Streams also does something like that? May be the producer will support `resumeTransaction()` properly in future.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---