Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Luke Chen Thu, 09 Nov 2023 00:51:42 -0800

Hi Christo,

Thanks for the KIP!


Some comments:
1. I agree with Kamal that a metric to cover the time taken to read data
from remote storage is helpful.

2. I can see there are some metrics are only on topic level, but some are
on partition level.
Could you explain why some of them are only on topic level?
Like RemoteLogSizeComputationTime, it's different from partition to
partition, will it be better to be exposed as partition metric?

3. `RemoteLogSizeBytes` metric hanging.
To compute the RemoteLogSizeBytes, we need to wait until all records in the
metadata topic loaded.
What will happen if it takes long to load the data from metadata topic?
Should we instead return -1 or something to indicate it's still loading?

Thanks.
Luke

On Fri, Nov 3, 2023 at 1:53 AM Kamal Chandraprakash <
[email protected]> wrote:

> Hi Christo,
>
> Thanks for expanding the scope of the KIP!  We should also cover the time
> taken to
> read data from remote storage. This will give our users a fair idea about
> the P99, P95,
> and P50 Fetch latency to read data from remote storage.
>
> The Fetch API request metrics currently provides a breakdown of the time
> spent on each item:
>
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L517
> Should we also provide `RemoteStorageTimeMs` item (only for FETCH API) so
> that users can
> understand the overall and per-step time taken?
>
> Regarding the Remote deletion metrics, should we also emit a metric to
> expose the oldest segment time?
> Users can configure the topic retention either by size (or) time. If time
> is configured, then emitting
> the oldest segment time allows the user to configure an alert on top of it
> and act accordingly.
>
> On Wed, Nov 1, 2023 at 7:07 PM Jorge Esteban Quilcate Otoya <
> [email protected]> wrote:
>
> > Thanks, Christo!
> >
> > 1. Agree. Having a further look into how many latency metrics are
> included
> > on the broker side there are only a few of them (e.g. request lifecycle)
> —
> > but seems mostly delegated to clients, or plugin in this case, to measure
> > this.
> >
> > 3.2. Personally, I find the record-based lag less useful as records can't
> > be relied as a stable unit of measure. So, if we can keep bytes- and
> > segment-based lag, LGTM.
> > 3.4.  Agree, these metrics should be on the broker side. Though if plugin
> > decides to take deletion as a background process, then it should have
> it's
> > own metrics. That's why I was thinking the calculation should be fairly
> > similar to the CopyLag: "these segments are available for deletion but
> > haven't been deleted yet"
> > 3.5. For lag metrics: could we add an explanation on how each lag will be
> > calculated, e.g. using which values, from which components, under which
> > circumstances do we expect these values to increase/decrease, etc. This
> > will clarify 3.4. and make it easier to agree and eventually test.
> >
> > 4. Sorry I wasn't clear. I meant similar to `RemoteCopyBytesPerSec` and
> > `RemoteFetchBytesPerSec`, we could consider to include
> > `RemoteDeleteBytesPerSec`.
> >
> > 5. and 6. Thanks for the explanation! It surely benefits to have these as
> > part of the set of metrics.
> >
> > Cheers,
> > Jorge.
> >
> > On Mon, 30 Oct 2023 at 16:07, Christo Lolov <[email protected]>
> > wrote:
> >
> > > Heya Jorge,
> > >
> > > Thank you for the insightful comments!
> > >
> > > 1. I see a value in such latency metrics but in my opinion the correct
> > > location for such metrics is in the plugins providing the underlying
> > > functionality. What are your thoughts on the matter?
> > >
> > > 2. Okay, I will look for and adjust the formatting today/tomorrow!
> > >
> > > 3.1 Done.
> > > 3.2 Sure, I will add this to the KIP later today, the suggestion makes
> > > sense to me. However, my question is, would you still find value in
> > > emitting metrics for all three i.e. RemoteCopyLagRecords,
> > > RemoteCopyLagBytes and RemoteCopyLagSegments or would you only keep
> > > RemoteCopyLagBytes and RemoteCopyLagSegments?
> > > 3.3. Yes, RemoteDeleteLagRecords was supposed to be an equivalent of
> > > RemoteCopyLagRecords. Once I have your opinion on 3.2 I will make the
> > > respective changes.
> > > 3.4. I envision these metrics to be added to Kafka rather than the
> > plugins.
> > > Today Kafka sends deletes to remote storage but does not know whether
> > those
> > > segments have been deleted immediately when the request has been sent
> or
> > > have been given to a background process to carry out the actual
> > reclamation
> > > of space. The purpose of this metric is to give an estimate in time
> which
> > > says "hey, we have called this many segments or bytes to be deleted".
> > >
> > > 4. I believe this goes down the same line of thinking as what you
> > mentioned
> > > in 3.3 - have I misunderstood something?
> > >
> > > 5. I have on a number of occasions found I do not have a metric to
> > quickly
> > > point me to what part of tiered storage functionality is experiencing
> an
> > > issue, in some scenarios a follower failing to build an auxiliary
> state.
> > An
> > > increase in the number of BuildRemoteLogAuxState requests per second
> can
> > > uncover problems for specific topics warranting a further
> investigation,
> > > which I tend to find difficult to judge purely based on parsing log
> > > statements. An increase in the number of errors can quickly zone in on
> > > followers failing as part of tiered storage and point me to look in the
> > > logs specifically for that component.
> > >
> > > 6. I find it useful start my investigations with respect to tiering
> > > problems by checking the rough size distribution of topics in remote.
> > From
> > > then on I try to correlate whether a historically high-volume topic
> > started
> > > experiencing a decrease in volume due to a decrease in produce traffic
> to
> > > that topic or due to an increase in lag on local storage due to the
> > broker
> > > slowing down for whatever reason. Besides correlation I would use such
> a
> > > metric to also confirm whether my rate calculations are correct i.e. if
> > > topic A receives X MB/s and rolls a segment every Y seconds with an
> > upload
> > > rate of Z MB/s do I see that much data actually being written in remote
> > > storage. Do these two scenarios demonstrate the usefulness I would have
> > > from such a metric and do the benefits make sense to you?
> > >
> > > 7. I agree. I have changed TotalRemoteLogSizeComputationTime,
> > > TotalRemoteLogSizeBytes, and TotalRemoteLogMetadataCount to
> > > RemoteLogSizeComputationTime, RemoteLogSizeBytes and
> > RemoteLogMetadataCount
> > > respectively.
> > >
> > > On Fri, 27 Oct 2023 at 15:24, Jorge Esteban Quilcate Otoya <
> > > [email protected]> wrote:
> > >
> > > > Hi Christo,
> > > >
> > > > Thanks for proposing KIP, this metrics will certainly be useful to
> > > operate
> > > > Kafka Tiered Storage as it becomes production-ready.
> > > >
> > > > 1. Given that the scope of the KIPs has grown to cover more metrics,
> > what
> > > > do you think about introducing latency metrics for RSM operations?
> > > > Copy and delete time metrics are quite obvious/simple on what they
> > > > represent; but fetch latency metrics would be helpful as remote
> > fetching
> > > > clients directly. e.g. having a "time to first byte" metric could
> help
> > to
> > > > know how much time is introduced by the remote tier to start serving
> > > > results to the consumer, or measuring how long it takes to return a
> > > > response to consumers.
> > > >
> > > > 2. Couple of requests/nits on the metrics table, could you:
> > > > - highlight the names (or have them on a separate column, as you
> > prefer)
> > > to
> > > > make it easier to read? If you choose to have another column, maybe
> > sort
> > > > them as "Name, Description, MBean" and adjust the width.
> > > > - group the related metrics in separate groups, e.g. Lag, Remote
> > Delete,
> > > > Remote Log Aux State, Remote Log Size; so we can elaborate on why
> these
> > > set
> > > > of metrics are needed. Maybe adding some examples on usage and how
> > > > actionable they are as the ones shared in previous emails would be
> > useful
> > > > to keep as part of the KIP.
> > > >
> > > > 3. On Lag metrics:
> > > > 3.1 I would suggest the following renames:
> > > > - TotalRemoteRecordsLag -> RemoteCopyLagRecords
> > > > - TotalRemoteBytesLag -> RemoteCopyLagBytes
> > > > - DeleteRemoteLag -> RemoteDeleteLagRecords
> > > > 3.2. I agree with Kamal that having a lag based on the number of
> > segments
> > > > would be useful to include. Segments could give a faster proxy to
> > > > understand whether the lag is meaningful or not. e.g. if the number
> of
> > > > records and bytes are high, but the segment lag is only small (e.g.
> 1),
> > > it
> > > > may be ok; but if the number of segments is high, then it can be more
> > > > relevant to operators.
> > > > 3.3. Could we consider having the same metrics for Delete Lag as
> there
> > > are
> > > > for Copy Lag? i.e. including RemoteDeleteLagBytes, and (potentially)
> > > > RemoteDeleteLag for segments.
> > > > 3.4. The description of delete lag is unclear to me: I though it was
> > > about
> > > > the remote segments to be deleted (because of total retention) but
> not
> > > > deleted yet; however from description it seems that it's related to
> > local
> > > > segments that are marked for deletion. Is this correct?
> > > >
> > > > 4. On Remote Delete metrics:
> > > > - Could we also include bytes-based metric as with Copy and Fetch? t
> > > would
> > > > be useful to know how many bytes are being deleted. If aggregated and
> > > > compared with copied bytes, we can get a sense of the amount of data
> > > stored
> > > > remotely, even if not exact (only considers segment size, not
> indexes)
> > > >
> > > > 5. On RemoteLogAuxState metrics: could you elaborate a bit more on
> the
> > > > purpose of this component and why the metrics proposed are needed?
> > > >
> > > > 6. On Total Remote Log Size metrics: similarly, could you elaborate
> on
> > > why
> > > > this metric is needed? I'm missing what makes this operation as
> > relevant
> > > > (compared to other internals) to have some metrics attached -- maybe
> if
> > > you
> > > > could shared scenarios where this metrics would be useful would be
> > > helpful.
> > > >
> > > > 7. On the metrics naming: not sure the `Total*` prefix is really
> needed
> > > or
> > > > adds meaning. When I found it useful is when there are related metric
> > > that
> > > > are a subset, then the total prefix helps: e.g.
> > > > `TotalProduceRequestsPerSec` and `FailedProduceRequestsPerSec`
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > > >
> > > > On Tue, 24 Oct 2023 at 12:24, Christo Lolov <[email protected]>
> > > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Now that 3.6 has been released, I would like to bring back
> attention
> > to
> > > > the
> > > > > following KIP for adding metrics to tiered storage targeting 3.7 -
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Add+more+metrics+to+Tiered+Storage
> > > > > .
> > > > >
> > > > > Let me know your thoughts about the list of metrics and their
> > > > granularity!
> > > > >
> > > > > Best,
> > > > > Christo
> > > > >
> > > > > On Fri, 13 Oct 2023 at 10:14, Christo Lolov <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Heya Gantigmaa,
> > > > > >
> > > > > > Apologies for the (very) late reply!
> > > > > >
> > > > > > Now that 3.6 has been released and reviewers have a bit more
> time I
> > > > will
> > > > > > be picking up this KIP again. I am more than happy to add useful
> > new
> > > > > > metrics to the KIP, I would just ask for a couple of days to
> review
> > > > your
> > > > > > pull request and I will come back to you.
> > > > > >
> > > > > > Best,
> > > > > > Christo
> > > > > >
> > > > > > On Mon, 25 Sept 2023 at 10:49, Gantigmaa Selenge <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Christo,
> > > > > >>
> > > > > >> Thank you for writing the KIP.
> > > > > >>
> > > > > >> I recently raised a PR to add metrics for tracking remote
> segment
> > > > > >> deletions
> > > > > >> (https://github.com/apache/kafka/pull/14375) but realised those
> > > > metrics
> > > > > >> were not mentioned in the original KIP-405 or KIP-930. Do you
> > think
> > > > > these
> > > > > >> would make sense to be added to this KIP and get included in the
> > > > > >> discussion?
> > > > > >>
> > > > > >> Regards,
> > > > > >> Gantigmaa
> > > > > >>
> > > > > >> On Wed, Aug 9, 2023 at 1:53 PM Christo Lolov <
> > > [email protected]>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Heya Kamal,
> > > > > >> >
> > > > > >> > Thank you for going through the KIP and for the question!
> > > > > >> >
> > > > > >> > I have been thinking about this and as an operator I might
> find
> > it
> > > > the
> > > > > >> most
> > > > > >> > useful to know all three of them actually.
> > > > > >> >
> > > > > >> > I would find knowing the size in bytes useful to determine how
> > > much
> > > > > >> disk I
> > > > > >> > might need to add temporarily to compensate for the slowdown.
> > > > > >> > I would find knowing the number of records useful, because
> using
> > > the
> > > > > >> > MessagesInPerSec metric I would be able to determine how old
> the
> > > > > records
> > > > > >> > which are facing problems are.
> > > > > >> > I would find knowing the number of segments useful because I
> > would
> > > > be
> > > > > >> able
> > > > > >> > to correlate this with whether I need to change
> > > > > >> > *remote.log.manager.task.interval.ms
> > > > > >> > <http://remote.log.manager.task.interval.ms> *to a lower or
> > > higher
> > > > > >> value.
> > > > > >> >
> > > > > >> > What are your thoughts on the above? Would you find some of
> them
> > > > more
> > > > > >> > useful than others?
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Christo
> > > > > >> >
> > > > > >> > On Tue, 8 Aug 2023 at 16:43, Kamal Chandraprakash <
> > > > > >> > [email protected]> wrote:
> > > > > >> >
> > > > > >> > > Hi Christo,
> > > > > >> > >
> > > > > >> > > Thanks for the KIP!
> > > > > >> > >
> > > > > >> > > The proposed tiered storage metrics are useful. The unit
> > > mentioned
> > > > > in
> > > > > >> the
> > > > > >> > > KIP is the number of records.
> > > > > >> > > Each topic can have varying amounts of records in a segment
> > > > > depending
> > > > > >> on
> > > > > >> > > the record size.
> > > > > >> > >
> > > > > >> > > Do you think having the tier-lag by number of segments (or)
> > size
> > > > of
> > > > > >> > > segments in bytes will be useful
> > > > > >> > > to the operator?
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Kamal
> > > > > >> > >
> > > > > >> > > On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov <
> > > > > [email protected]>
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Hello all!
> > > > > >> > > >
> > > > > >> > > > I would like to start a discussion for KIP-963: Upload and
> > > > delete
> > > > > >> lag
> > > > > >> > > > metrics in Tiered Storage (
> > > > > >> > https://cwiki.apache.org/confluence/x/sZGzDw
> > > > > >> > > ).
> > > > > >> > > >
> > > > > >> > > > The purpose of this KIP is to introduce a couple of
> metrics
> > to
> > > > > track
> > > > > >> > lag
> > > > > >> > > > with respect to remote storage from the point of view of
> > > Kafka.
> > > > > >> > > >
> > > > > >> > > > Thanks in advance for leaving a review!
> > > > > >> > > >
> > > > > >> > > > Best,
> > > > > >> > > > Christo
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Reply via email to