[
https://issues.apache.org/jira/browse/KAFKA-20418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079525#comment-18079525
]
sanghyeok An commented on KAFKA-20418:
--------------------------------------
Hi, [~jolshan],
Sorry for the sudden mention 🙇‍♂️
As I mentioned in KAFKA-20407, I am preparing a KIP for adding transaction
state log append latency metrics.
While preparing the KIP, I am also considering including KAFKA-20418 in the
scope,
since it seems related to the same transaction coordinator observability
problem.
If that sounds reasonable, I would like to bring this up in the KIP discussion
thread and discuss it with folks there as well.
What do you think? Please let me know your thoughts!
> Consider adding metrics for pending transaction markers and oldest
> transaction age
> ----------------------------------------------------------------------------------
>
> Key: KAFKA-20418
> URL: https://issues.apache.org/jira/browse/KAFKA-20418
> Project: Kafka
> Issue Type: Improvement
> Reporter: sanghyeok An
> Assignee: sanghyeok An
> Priority: Minor
> Labels: transaction
>
> When transaction handling becomes slow, it is difficult to tell whether the
> delay is coming from the transaction state log append path, the post-EndTxn
> marker completion path, or transactions remaining in coordinator state longer
> than expected.
> The broker already exposes some transaction-related metrics, but it is still
> hard to answer questions such as:
> * how many transactions are currently waiting for marker completion
> * whether pending marker backlog is growing or aging
> * whether transactions are staying in a given state for unusually long
> periods
> Adding a small set of metrics in this area could improve operability by
> making it easier to identify transaction backlog and long-lived transactions
> in the coordinator. Â Suggested metrics:
> * pending-marker-count
> * pending-marker-oldest-age-ms
> * oldest-transaction-age-ms\{state}
> Â These metrics could be useful in scenarios such as:
> * transaction completion appears slow even though request handling itself is
> not obviously delayed
> * marker propagation is backed up due to inter-broker issues or broker-side
> delays
> * some transactions remain in ONGOING or PREPARE_* states for much longer
> than expected
> * operators need to distinguish transaction state append issues from marker
> completion issues or long-lived transaction state
> Â
> There is already internal transaction and pending marker state tracking, so
> exposing related metrics may be feasible and useful for broker operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)