Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Jorge Esteban Quilcate Otoya Fri, 27 Oct 2023 07:24:11 -0700

Hi Christo,

Thanks for proposing KIP, this metrics will certainly be useful to operate
Kafka Tiered Storage as it becomes production-ready.


1. Given that the scope of the KIPs has grown to cover more metrics, what
do you think about introducing latency metrics for RSM operations?
Copy and delete time metrics are quite obvious/simple on what they
represent; but fetch latency metrics would be helpful as remote fetching
clients directly. e.g. having a "time to first byte" metric could help to
know how much time is introduced by the remote tier to start serving
results to the consumer, or measuring how long it takes to return a
response to consumers.

2. Couple of requests/nits on the metrics table, could you:
- highlight the names (or have them on a separate column, as you prefer) to
make it easier to read? If you choose to have another column, maybe sort
them as "Name, Description, MBean" and adjust the width.
- group the related metrics in separate groups, e.g. Lag, Remote Delete,
Remote Log Aux State, Remote Log Size; so we can elaborate on why these set
of metrics are needed. Maybe adding some examples on usage and how
actionable they are as the ones shared in previous emails would be useful
to keep as part of the KIP.

3. On Lag metrics:
3.1 I would suggest the following renames:
- TotalRemoteRecordsLag -> RemoteCopyLagRecords
- TotalRemoteBytesLag -> RemoteCopyLagBytes
- DeleteRemoteLag -> RemoteDeleteLagRecords
3.2. I agree with Kamal that having a lag based on the number of segments
would be useful to include. Segments could give a faster proxy to
understand whether the lag is meaningful or not. e.g. if the number of
records and bytes are high, but the segment lag is only small (e.g. 1), it
may be ok; but if the number of segments is high, then it can be more
relevant to operators.
3.3. Could we consider having the same metrics for Delete Lag as there are
for Copy Lag? i.e. including RemoteDeleteLagBytes, and (potentially)
RemoteDeleteLag for segments.
3.4. The description of delete lag is unclear to me: I though it was about
the remote segments to be deleted (because of total retention) but not
deleted yet; however from description it seems that it's related to local
segments that are marked for deletion. Is this correct?

4. On Remote Delete metrics:
- Could we also include bytes-based metric as with Copy and Fetch? t would
be useful to know how many bytes are being deleted. If aggregated and
compared with copied bytes, we can get a sense of the amount of data stored
remotely, even if not exact (only considers segment size, not indexes)

5. On RemoteLogAuxState metrics: could you elaborate a bit more on the
purpose of this component and why the metrics proposed are needed?

6. On Total Remote Log Size metrics: similarly, could you elaborate on why
this metric is needed? I'm missing what makes this operation as relevant
(compared to other internals) to have some metrics attached -- maybe if you
could shared scenarios where this metrics would be useful would be helpful.

7. On the metrics naming: not sure the `Total*` prefix is really needed or
adds meaning. When I found it useful is when there are related metric that
are a subset, then the total prefix helps: e.g.
`TotalProduceRequestsPerSec` and `FailedProduceRequestsPerSec`

Cheers,
Jorge.


On Tue, 24 Oct 2023 at 12:24, Christo Lolov <[email protected]> wrote:

> Hello all,
>
> Now that 3.6 has been released, I would like to bring back attention to the
> following KIP for adding metrics to tiered storage targeting 3.7 -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Add+more+metrics+to+Tiered+Storage
> .
>
> Let me know your thoughts about the list of metrics and their granularity!
>
> Best,
> Christo
>
> On Fri, 13 Oct 2023 at 10:14, Christo Lolov <[email protected]>
> wrote:
>
> > Heya Gantigmaa,
> >
> > Apologies for the (very) late reply!
> >
> > Now that 3.6 has been released and reviewers have a bit more time I will
> > be picking up this KIP again. I am more than happy to add useful new
> > metrics to the KIP, I would just ask for a couple of days to review your
> > pull request and I will come back to you.
> >
> > Best,
> > Christo
> >
> > On Mon, 25 Sept 2023 at 10:49, Gantigmaa Selenge <[email protected]>
> > wrote:
> >
> >> Hi Christo,
> >>
> >> Thank you for writing the KIP.
> >>
> >> I recently raised a PR to add metrics for tracking remote segment
> >> deletions
> >> (https://github.com/apache/kafka/pull/14375) but realised those metrics
> >> were not mentioned in the original KIP-405 or KIP-930. Do you think
> these
> >> would make sense to be added to this KIP and get included in the
> >> discussion?
> >>
> >> Regards,
> >> Gantigmaa
> >>
> >> On Wed, Aug 9, 2023 at 1:53 PM Christo Lolov <[email protected]>
> >> wrote:
> >>
> >> > Heya Kamal,
> >> >
> >> > Thank you for going through the KIP and for the question!
> >> >
> >> > I have been thinking about this and as an operator I might find it the
> >> most
> >> > useful to know all three of them actually.
> >> >
> >> > I would find knowing the size in bytes useful to determine how much
> >> disk I
> >> > might need to add temporarily to compensate for the slowdown.
> >> > I would find knowing the number of records useful, because using the
> >> > MessagesInPerSec metric I would be able to determine how old the
> records
> >> > which are facing problems are.
> >> > I would find knowing the number of segments useful because I would be
> >> able
> >> > to correlate this with whether I need to change
> >> > *remote.log.manager.task.interval.ms
> >> > <http://remote.log.manager.task.interval.ms> *to a lower or higher
> >> value.
> >> >
> >> > What are your thoughts on the above? Would you find some of them more
> >> > useful than others?
> >> >
> >> > Best,
> >> > Christo
> >> >
> >> > On Tue, 8 Aug 2023 at 16:43, Kamal Chandraprakash <
> >> > [email protected]> wrote:
> >> >
> >> > > Hi Christo,
> >> > >
> >> > > Thanks for the KIP!
> >> > >
> >> > > The proposed tiered storage metrics are useful. The unit mentioned
> in
> >> the
> >> > > KIP is the number of records.
> >> > > Each topic can have varying amounts of records in a segment
> depending
> >> on
> >> > > the record size.
> >> > >
> >> > > Do you think having the tier-lag by number of segments (or) size of
> >> > > segments in bytes will be useful
> >> > > to the operator?
> >> > >
> >> > > Thanks,
> >> > > Kamal
> >> > >
> >> > > On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov <
> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > Hello all!
> >> > > >
> >> > > > I would like to start a discussion for KIP-963: Upload and delete
> >> lag
> >> > > > metrics in Tiered Storage (
> >> > https://cwiki.apache.org/confluence/x/sZGzDw
> >> > > ).
> >> > > >
> >> > > > The purpose of this KIP is to introduce a couple of metrics to
> track
> >> > lag
> >> > > > with respect to remote storage from the point of view of Kafka.
> >> > > >
> >> > > > Thanks in advance for leaving a review!
> >> > > >
> >> > > > Best,
> >> > > > Christo
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Reply via email to