Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Christo Lolov Mon, 30 Oct 2023 07:07:04 -0700

Heya Jorge,

Thank you for the insightful comments!


1. I see a value in such latency metrics but in my opinion the correct
location for such metrics is in the plugins providing the underlying
functionality. What are your thoughts on the matter?

2. Okay, I will look for and adjust the formatting today/tomorrow!

3.1 Done.
3.2 Sure, I will add this to the KIP later today, the suggestion makes
sense to me. However, my question is, would you still find value in
emitting metrics for all three i.e. RemoteCopyLagRecords,
RemoteCopyLagBytes and RemoteCopyLagSegments or would you only keep
RemoteCopyLagBytes and RemoteCopyLagSegments?
3.3. Yes, RemoteDeleteLagRecords was supposed to be an equivalent of
RemoteCopyLagRecords. Once I have your opinion on 3.2 I will make the
respective changes.
3.4. I envision these metrics to be added to Kafka rather than the plugins.
Today Kafka sends deletes to remote storage but does not know whether those
segments have been deleted immediately when the request has been sent or
have been given to a background process to carry out the actual reclamation
of space. The purpose of this metric is to give an estimate in time which
says "hey, we have called this many segments or bytes to be deleted".

4. I believe this goes down the same line of thinking as what you mentioned
in 3.3 - have I misunderstood something?

5. I have on a number of occasions found I do not have a metric to quickly
point me to what part of tiered storage functionality is experiencing an
issue, in some scenarios a follower failing to build an auxiliary state. An
increase in the number of BuildRemoteLogAuxState requests per second can
uncover problems for specific topics warranting a further investigation,
which I tend to find difficult to judge purely based on parsing log
statements. An increase in the number of errors can quickly zone in on
followers failing as part of tiered storage and point me to look in the
logs specifically for that component.

6. I find it useful start my investigations with respect to tiering
problems by checking the rough size distribution of topics in remote. From
then on I try to correlate whether a historically high-volume topic started
experiencing a decrease in volume due to a decrease in produce traffic to
that topic or due to an increase in lag on local storage due to the broker
slowing down for whatever reason. Besides correlation I would use such a
metric to also confirm whether my rate calculations are correct i.e. if
topic A receives X MB/s and rolls a segment every Y seconds with an upload
rate of Z MB/s do I see that much data actually being written in remote
storage. Do these two scenarios demonstrate the usefulness I would have
from such a metric and do the benefits make sense to you?

7. I agree. I have changed TotalRemoteLogSizeComputationTime,
TotalRemoteLogSizeBytes, and TotalRemoteLogMetadataCount to
RemoteLogSizeComputationTime, RemoteLogSizeBytes and RemoteLogMetadataCount
respectively.

On Fri, 27 Oct 2023 at 15:24, Jorge Esteban Quilcate Otoya <
[email protected]> wrote:

> Hi Christo,
>
> Thanks for proposing KIP, this metrics will certainly be useful to operate
> Kafka Tiered Storage as it becomes production-ready.
>
> 1. Given that the scope of the KIPs has grown to cover more metrics, what
> do you think about introducing latency metrics for RSM operations?
> Copy and delete time metrics are quite obvious/simple on what they
> represent; but fetch latency metrics would be helpful as remote fetching
> clients directly. e.g. having a "time to first byte" metric could help to
> know how much time is introduced by the remote tier to start serving
> results to the consumer, or measuring how long it takes to return a
> response to consumers.
>
> 2. Couple of requests/nits on the metrics table, could you:
> - highlight the names (or have them on a separate column, as you prefer) to
> make it easier to read? If you choose to have another column, maybe sort
> them as "Name, Description, MBean" and adjust the width.
> - group the related metrics in separate groups, e.g. Lag, Remote Delete,
> Remote Log Aux State, Remote Log Size; so we can elaborate on why these set
> of metrics are needed. Maybe adding some examples on usage and how
> actionable they are as the ones shared in previous emails would be useful
> to keep as part of the KIP.
>
> 3. On Lag metrics:
> 3.1 I would suggest the following renames:
> - TotalRemoteRecordsLag -> RemoteCopyLagRecords
> - TotalRemoteBytesLag -> RemoteCopyLagBytes
> - DeleteRemoteLag -> RemoteDeleteLagRecords
> 3.2. I agree with Kamal that having a lag based on the number of segments
> would be useful to include. Segments could give a faster proxy to
> understand whether the lag is meaningful or not. e.g. if the number of
> records and bytes are high, but the segment lag is only small (e.g. 1), it
> may be ok; but if the number of segments is high, then it can be more
> relevant to operators.
> 3.3. Could we consider having the same metrics for Delete Lag as there are
> for Copy Lag? i.e. including RemoteDeleteLagBytes, and (potentially)
> RemoteDeleteLag for segments.
> 3.4. The description of delete lag is unclear to me: I though it was about
> the remote segments to be deleted (because of total retention) but not
> deleted yet; however from description it seems that it's related to local
> segments that are marked for deletion. Is this correct?
>
> 4. On Remote Delete metrics:
> - Could we also include bytes-based metric as with Copy and Fetch? t would
> be useful to know how many bytes are being deleted. If aggregated and
> compared with copied bytes, we can get a sense of the amount of data stored
> remotely, even if not exact (only considers segment size, not indexes)
>
> 5. On RemoteLogAuxState metrics: could you elaborate a bit more on the
> purpose of this component and why the metrics proposed are needed?
>
> 6. On Total Remote Log Size metrics: similarly, could you elaborate on why
> this metric is needed? I'm missing what makes this operation as relevant
> (compared to other internals) to have some metrics attached -- maybe if you
> could shared scenarios where this metrics would be useful would be helpful.
>
> 7. On the metrics naming: not sure the `Total*` prefix is really needed or
> adds meaning. When I found it useful is when there are related metric that
> are a subset, then the total prefix helps: e.g.
> `TotalProduceRequestsPerSec` and `FailedProduceRequestsPerSec`
>
> Cheers,
> Jorge.
>
>
> On Tue, 24 Oct 2023 at 12:24, Christo Lolov <[email protected]>
> wrote:
>
> > Hello all,
> >
> > Now that 3.6 has been released, I would like to bring back attention to
> the
> > following KIP for adding metrics to tiered storage targeting 3.7 -
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Add+more+metrics+to+Tiered+Storage
> > .
> >
> > Let me know your thoughts about the list of metrics and their
> granularity!
> >
> > Best,
> > Christo
> >
> > On Fri, 13 Oct 2023 at 10:14, Christo Lolov <[email protected]>
> > wrote:
> >
> > > Heya Gantigmaa,
> > >
> > > Apologies for the (very) late reply!
> > >
> > > Now that 3.6 has been released and reviewers have a bit more time I
> will
> > > be picking up this KIP again. I am more than happy to add useful new
> > > metrics to the KIP, I would just ask for a couple of days to review
> your
> > > pull request and I will come back to you.
> > >
> > > Best,
> > > Christo
> > >
> > > On Mon, 25 Sept 2023 at 10:49, Gantigmaa Selenge <[email protected]>
> > > wrote:
> > >
> > >> Hi Christo,
> > >>
> > >> Thank you for writing the KIP.
> > >>
> > >> I recently raised a PR to add metrics for tracking remote segment
> > >> deletions
> > >> (https://github.com/apache/kafka/pull/14375) but realised those
> metrics
> > >> were not mentioned in the original KIP-405 or KIP-930. Do you think
> > these
> > >> would make sense to be added to this KIP and get included in the
> > >> discussion?
> > >>
> > >> Regards,
> > >> Gantigmaa
> > >>
> > >> On Wed, Aug 9, 2023 at 1:53 PM Christo Lolov <[email protected]>
> > >> wrote:
> > >>
> > >> > Heya Kamal,
> > >> >
> > >> > Thank you for going through the KIP and for the question!
> > >> >
> > >> > I have been thinking about this and as an operator I might find it
> the
> > >> most
> > >> > useful to know all three of them actually.
> > >> >
> > >> > I would find knowing the size in bytes useful to determine how much
> > >> disk I
> > >> > might need to add temporarily to compensate for the slowdown.
> > >> > I would find knowing the number of records useful, because using the
> > >> > MessagesInPerSec metric I would be able to determine how old the
> > records
> > >> > which are facing problems are.
> > >> > I would find knowing the number of segments useful because I would
> be
> > >> able
> > >> > to correlate this with whether I need to change
> > >> > *remote.log.manager.task.interval.ms
> > >> > <http://remote.log.manager.task.interval.ms> *to a lower or higher
> > >> value.
> > >> >
> > >> > What are your thoughts on the above? Would you find some of them
> more
> > >> > useful than others?
> > >> >
> > >> > Best,
> > >> > Christo
> > >> >
> > >> > On Tue, 8 Aug 2023 at 16:43, Kamal Chandraprakash <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Hi Christo,
> > >> > >
> > >> > > Thanks for the KIP!
> > >> > >
> > >> > > The proposed tiered storage metrics are useful. The unit mentioned
> > in
> > >> the
> > >> > > KIP is the number of records.
> > >> > > Each topic can have varying amounts of records in a segment
> > depending
> > >> on
> > >> > > the record size.
> > >> > >
> > >> > > Do you think having the tier-lag by number of segments (or) size
> of
> > >> > > segments in bytes will be useful
> > >> > > to the operator?
> > >> > >
> > >> > > Thanks,
> > >> > > Kamal
> > >> > >
> > >> > > On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov <
> > [email protected]>
> > >> > > wrote:
> > >> > >
> > >> > > > Hello all!
> > >> > > >
> > >> > > > I would like to start a discussion for KIP-963: Upload and
> delete
> > >> lag
> > >> > > > metrics in Tiered Storage (
> > >> > https://cwiki.apache.org/confluence/x/sZGzDw
> > >> > > ).
> > >> > > >
> > >> > > > The purpose of this KIP is to introduce a couple of metrics to
> > track
> > >> > lag
> > >> > > > with respect to remote storage from the point of view of Kafka.
> > >> > > >
> > >> > > > Thanks in advance for leaving a review!
> > >> > > >
> > >> > > > Best,
> > >> > > > Christo
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Reply via email to