[
https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stanislav Kozlovski resolved KAFKA-15147.
-----------------------------------------
Resolution: Fixed
> Measure pending and outstanding Remote Segment operations
> ---------------------------------------------------------
>
> Key: KAFKA-15147
> URL: https://issues.apache.org/jira/browse/KAFKA-15147
> Project: Kafka
> Issue Type: Improvement
> Components: core
> Reporter: Jorge Esteban Quilcate Otoya
> Assignee: Christo Lolov
> Priority: Major
> Labels: tiered-storage
> Fix For: 3.7.0
>
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Upload+and+delete+lag+metrics+in+Tiered+Storage
>
> KAFKA-15833: RemoteCopyLagBytes
> KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes,
> RemoteDeleteLagSegments
> KAFKA-16013: ExpiresPerSec
> KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes,
> RemoteLogMetadataCount
> KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec,
> BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec
> ====
> Remote Log Segment operations (copy/delete) are executed by the Remote
> Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default
> TopicBasedRLMM writes to the internal Kafka topic state changes on remote log
> segments).
> As executions run, fail, and retry; it will be important to know how many
> operations are pending and outstanding over time to alert operators.
> Pending operations are not enough to alert, as values can oscillate closer to
> zero. An additional condition needs to apply (running time > threshold) to
> consider an operation outstanding.
> Proposal:
> RemoteLogManager could be extended with 2 concurrent maps
> (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure
> segmentId time when operation started, and based on this expose 2 metrics per
> operation:
> * pendingSegmentCopies: gauge of pendingSegmentCopies map
> * outstandingSegmentCopies: loop over pending ops, and if now - startedTime
> > timeout, then outstanding++ (maybe on debug level?)
> Is this a valuable metric to add to Tiered Storage? or better to solve on a
> custom RLMM implementation?
> Also, does it require a KIP?
> Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)