[ https://issues.apache.org/jira/browse/KAFKA-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572763#comment-16572763 ]
Jason Gustafson commented on KAFKA-7190: ---------------------------------------- This is a tough one. To guarantee transaction semantics, we need to retain producer state in the log. Without that state, our only options are to raise an error or weaken semantics. The problem with deleting beyond the LSO is that we may lose the producer state of an active transaction. As I understand it, the proposal here is to retain the state in memory even though we have lost it in the log, but in the worst case, we would still end up raising the UNKNOWN_PRODUCER error. The log is ultimately the source of truth for producer state. Doesn't it seem odd that a call to DeleteRecords can effectively kill a producer with an active transaction? What I'm wondering is whether deletion can be "soft" in the case that the offset is higher than the LSO. We can advance the log start offset to the new offset, but we can retain the data in the log until the LSO has reached the new log start offset. Then we could guarantee that the producer state of an active transaction is never lost. This is useful because if a transactional produce request arrives and we have no producer state, then we know that it is either the start of a new transaction and safe to allow or it is a stale write from a fenced producer. The holy grail is being able to distinguish between these two cases. One option I was thinking about is letting each transaction start at sequence number 0. This would allow us to distinguish these two cases for all but the first record in a transaction. Leaving the one loose end is not satisfying, but technically it was already loose before. It is possible today for a producer to start a transaction and then become a zombie. If its transaction gets aborted by the coordinator and the state is lost due to a call to DeleteRecords, then the zombie can still wakeup and write to the partition. I'm not too sure how we'll fix this, but the point is we have to fix it anyway. > Under low traffic conditions purging repartition topics cause WARN statements > about UNKNOWN_PRODUCER_ID > --------------------------------------------------------------------------------------------------------- > > Key: KAFKA-7190 > URL: https://issues.apache.org/jira/browse/KAFKA-7190 > Project: Kafka > Issue Type: Improvement > Components: core, streams > Affects Versions: 1.1.0, 1.1.1 > Reporter: Bill Bejeck > Assignee: lambdaliu > Priority: Major > > When a streams application has little traffic, then it is possible that > consumer purging would delete > even the last message sent by a producer (i.e., all the messages sent by > this producer have been consumed and committed), and as a result, the broker > would delete that producer's ID. The next time when this producer tries to > send, it will get this UNKNOWN_PRODUCER_ID error code, but in this case, > this error is retriable: the producer would just get a new producer id and > retries, and then this time it will succeed. > > Possible fixes could be on the broker side, i.e., delaying the deletion of > the produderIDs for a more extended period or on the streams side developing > a more conservative approach to deleting offsets from repartition topics > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)