[ 
https://issues.apache.org/jira/browse/KAFKA-20407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079109#comment-18079109
 ] 

sanghyeok An commented on KAFKA-20407:
--------------------------------------

[~jolshan] Thanks for your comments!
I apologize if my previous explanation was insufficient. 

Yes, the main idea is to make it easier to identify whether the transaction 
state log write path is contributing to transaction latency.

I would not say that this metric alone directly points to a specific 
remediation. The underlying cause could still be local storage, replication/ISR 
delays, leader movement, or general broker load. Rather, the metric would help 
narrow down the investigation by separating the transaction state write path 
from other parts of transaction handling, such as request processing or marker 
propagation.

Today, if transaction operations become slow, request-level metrics do not 
clearly show whether appending transaction state transitions to 
__transaction_state is involved. A dedicated metric around this path would 
provide more transaction-specific attribution. Operators could then correlate 
it with existing broker, disk, ISR, and leadership metrics to understand the 
likely cause.

So I see this primarily as a diagnostic/triage metric, not a metric that by 
itself prescribes a remediation action.

What do you think...? 

> Consider adding transaction state log append latency metrics
> ------------------------------------------------------------
>
>                 Key: KAFKA-20407
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20407
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: sanghyeok An
>            Assignee: sanghyeok An
>            Priority: Minor
>              Labels: needs-kip, transaction
>
> Slow appends to __transaction_state can affect transaction operations such as 
> {*}InitProducerId{*}, {*}AddPartitionsToTxn{*}, and {*}EndTxn{*}.
>  
> When transaction latency increases, it is difficult to distinguish whether 
> the slowdown comes from the request/network path or from appending 
> transaction state transitions to {*}__transaction_state{*}. Existing metrics 
> do not isolate the transaction state log append path, which makes diagnosis 
> harder.
>  
> A dedicated metric for transaction state log append latency would improve 
> operability by making it easier to:
>  * identify when transaction latency is driven by the transaction state topic 
> write path
>  * correlate transaction slowdowns with storage, ISR, or leader movement 
> issues affecting __transaction_state
>  * separate transaction state write-path issues from higher-level request 
> latency
>  * reduce time to diagnosis when transaction-related latency regresses
> There is also a similar precedent in the {*}Share Coordinator{*}, which 
> already exposes *write-latency* metrics for state writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to