[ 
https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen updated KAFKA-15147:
------------------------------
    Description: 



KAFKA-15833: RemoteCopyLagBytes 

KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes, 
RemoteDeleteLagSegments

KAFKA-16013: ExpiresPerSec

KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes, 
RemoteLogMetadataCount

KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, 
BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec

====

Remote Log Segment operations (copy/delete) are executed by the Remote Storage 
Manager, and recorded by Remote Log Metadata Manager (e.g. default 
TopicBasedRLMM writes to the internal Kafka topic state changes on remote log 
segments).

As executions run, fail, and retry; it will be important to know how many 
operations are pending and outstanding over time to alert operators.

Pending operations are not enough to alert, as values can oscillate closer to 
zero. An additional condition needs to apply (running time > threshold) to 
consider an operation outstanding.

Proposal:

RemoteLogManager could be extended with 2 concurrent maps 
(pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure 
segmentId time when operation started, and based on this expose 2 metrics per 
operation:
 * pendingSegmentCopies: gauge of pendingSegmentCopies map
 * outstandingSegmentCopies: loop over pending ops, and if now - startedTime > 
timeout, then outstanding++ (maybe on debug level?)

Is this a valuable metric to add to Tiered Storage? or better to solve on a 
custom RLMM implementation?

Also, does it require a KIP?

Thanks!

  was:
Remote Log Segment operations (copy/delete) are executed by the Remote Storage 
Manager, and recorded by Remote Log Metadata Manager (e.g. default 
TopicBasedRLMM writes to the internal Kafka topic state changes on remote log 
segments).

As executions run, fail, and retry; it will be important to know how many 
operations are pending and outstanding over time to alert operators.

Pending operations are not enough to alert, as values can oscillate closer to 
zero. An additional condition needs to apply (running time > threshold) to 
consider an operation outstanding.

Proposal:

RemoteLogManager could be extended with 2 concurrent maps 
(pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure 
segmentId time when operation started, and based on this expose 2 metrics per 
operation:
 * pendingSegmentCopies: gauge of pendingSegmentCopies map
 * outstandingSegmentCopies: loop over pending ops, and if now - startedTime > 
timeout, then outstanding++ (maybe on debug level?)

Is this a valuable metric to add to Tiered Storage? or better to solve on a 
custom RLMM implementation?

Also, does it require a KIP?

Thanks!


> Measure pending and outstanding Remote Segment operations
> ---------------------------------------------------------
>
>                 Key: KAFKA-15147
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15147
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>            Reporter: Jorge Esteban Quilcate Otoya
>            Assignee: Christo Lolov
>            Priority: Major
>              Labels: tiered-storage
>             Fix For: 3.7.0
>
>
> KAFKA-15833: RemoteCopyLagBytes 
> KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes, 
> RemoteDeleteLagSegments
> KAFKA-16013: ExpiresPerSec
> KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes, 
> RemoteLogMetadataCount
> KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, 
> BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec
> ====
> Remote Log Segment operations (copy/delete) are executed by the Remote 
> Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default 
> TopicBasedRLMM writes to the internal Kafka topic state changes on remote log 
> segments).
> As executions run, fail, and retry; it will be important to know how many 
> operations are pending and outstanding over time to alert operators.
> Pending operations are not enough to alert, as values can oscillate closer to 
> zero. An additional condition needs to apply (running time > threshold) to 
> consider an operation outstanding.
> Proposal:
> RemoteLogManager could be extended with 2 concurrent maps 
> (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure 
> segmentId time when operation started, and based on this expose 2 metrics per 
> operation:
>  * pendingSegmentCopies: gauge of pendingSegmentCopies map
>  * outstandingSegmentCopies: loop over pending ops, and if now - startedTime 
> > timeout, then outstanding++ (maybe on debug level?)
> Is this a valuable metric to add to Tiered Storage? or better to solve on a 
> custom RLMM implementation?
> Also, does it require a KIP?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to