[ 
https://issues.apache.org/jira/browse/KAFKA-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781148#comment-17781148
 ] 

Justine Olshan commented on KAFKA-14402:
----------------------------------------

Sorry the wording is unclear. I will fix this. For older clients, the server 
will verify the partitions are in a given transaction. Older clients require no 
changes. If we changed them, they would be considered new (version) clients.

The idea of KIP-890 part 1 would be that there were no client changes required. 
The flow is the same from the client side. Just on the server side we verify 
the transaction is ongoing. Hopefully this makes sense.

> Transactions Server Side Defense
> --------------------------------
>
>                 Key: KAFKA-14402
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14402
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 3.5.0
>            Reporter: Justine Olshan
>            Assignee: Justine Olshan
>            Priority: Major
>
> We have seen hanging transactions in Kafka where the last stable offset (LSO) 
> does not update, we can’t clean the log (if the topic is compacted), and 
> read_committed consumers get stuck.
> This can happen when a message gets stuck or delayed due to networking issues 
> or a network partition, the transaction aborts, and then the delayed message 
> finally comes in. The delayed message case can also violate EOS if the 
> delayed message comes in after the next addPartitionsToTxn request comes in. 
> Effectively we may see a message from a previous (aborted) transaction become 
> part of the next transaction.
> Another way hanging transactions can occur is that a client is buggy and may 
> somehow try to write to a partition before it adds the partition to the 
> transaction. In both of these cases, we want the server to have some control 
> to prevent these incorrect records from being written and either causing 
> hanging transactions or violating Exactly once semantics (EOS) by including 
> records in the wrong transaction.
> The best way to avoid this issue is to:
>  # *Uniquely identify transactions by bumping the producer epoch after every 
> commit/abort marker. That way, each transaction can be identified by 
> (producer id, epoch).* 
>  # {*}Remove the addPartitionsToTxn call and implicitly just add partitions 
> to the transaction on the first produce request during a transaction{*}.
> We avoid the late arrival case because the transaction is uniquely identified 
> and fenced AND we avoid the buggy client case because we remove the need for 
> the client to explicitly add partitions to begin the transaction.
> Of course, 1 and 2 require client-side changes, so for older clients, those 
> approaches won’t apply.
> 3. *To cover older clients, we will ensure a transaction is ongoing before we 
> write to a transaction. We can do this by querying the transaction 
> coordinator and caching the result.*
>  
> See KIP-890 for more information: ** 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to