Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Christo Lolov Fri, 17 Nov 2023 08:46:13 -0800

Heya all!

I have updated the KIP so please have another read through when you have
the time. I know we are cutting it a bit close, but I would be grateful if
I could start a vote early next week in order to get this in 3.7.


re: Satish

104. I envision that ideally we will compare this metric with
*RemoteCopyLagSegments*, yes.

re: Jorge

I have been thinking about your suggestion for RemoteDeleteBytesPerSec. The
*BytesPerSec make sense on the Copy and Read paths because there we
actually write a whole segment or read it, but on the delete path we tend
to just mark it for deletion, as such we don't really have deleted bytes
per sec. Or am I misunderstanding why you want this metric added?

I have also thought a bit more about the LocalDeleteLag and your
description. If I understand correctly you propose this metric to monitor
the segments expired due to local retention, have the .deleted suffix, but
haven't yet been actually deleted by the LogCleaner. This will serve as a
proxy for how much data which we should be serving from remote but we are
serving from local? However, I believe that the moment we add the .delete
suffix we stop serving traffic from those segments, hence we will be
serving requests for those from remote. Am I wrong?

Best,
Christo

On Thu, 16 Nov 2023 at 08:45, Satish Duggana <[email protected]>
wrote:

> Thanks Christo for your reply.
>
> 101 and 102 We have conclusion on them.
>
> 103. I am not strongly opinionated on this. I am fine if it is helpful
> for your scenarios.
>
> 104. It seems you want to compare this metric with the number of
> segments that are copied. Do you have such a metric now?
>
> Kamal and Luke,
> I agree some of the metrics are needed outside of RSM layer in remote
> fetch path. Can we take those fine grained remote fetch flow sequence
> metrics separately later?
>
> Thanks,
> Satish.
>
> On Tue, 14 Nov 2023 at 22:07, Christo Lolov <[email protected]>
> wrote:
> >
> > Heya everyone,
> >
> > Apologies for the delay in my response and thank you very much for all
> your
> > comments! I will start answering in reverse:
> >
> > *re: Satish*
> >
> > 101. I am happy to scope down this KIP and start off by emitting those
> > metrics on a topic level. I had a preference to emit them on a partition
> > level because I have ran into situations where data wasn't evenly spread
> > across partitions and not having that granularity made it harder to
> > discover.
> >
> > 102. Fair enough, others have expressed the same preference. I will scope
> > down the KIP to only bytes-based and segment-based metrics.
> >
> > 103. I agree that we could do this, but I personally prefer this to be a
> > metric. At the very least a newcomer might not know to look for the log
> > line, while most metric-collection systems allow you to explore the whole
> > namespace. For example, I really dislike that while log loading happens
> > Kafka emits log lines of "X/Y logs loaded" rather than just show me the
> > progress via a metric. If you are strongly against this, however, I am
> > happy to scope down on this as well.
> >
> > 104. Ideally we have only one metadata in remote storage for every
> segment
> > of the correct lineage. Due to leadership changes, however, this is not
> > always the case. I envisioned that exposing such a metric will showcase
> if
> > there are problems with too many metadata records not part of the correct
> > lineage of a log.
> >
> > *re: Luke*
> >
> > 1. I am a bit conflicted on this one. As discussed earlier with Jorge, in
> > my head such metrics are better left to plugin implementations. If you
> and
> > Kamal feel strongly about this being included I will add it to the KIP.
> >
> > 2. After running tiered storage in production for a while I ran into
> > problems where a partition-level metric would have allowed me to zone in
> on
> > the problem sooner. I tried balancing this with not exposing everything
> on
> > a partition level so not to explode the cardinality too much (point 101
> > from Satish). I haven't ran into a situation where knowing the
> > RemoteLogSizeComputationTime on a partition level helped me, but this
> > doesn't mean there isn't one.
> >
> > 3. I was thinking that the metric can be emitted while reading of those
> > records is happening i.e. if it takes a long time then it will just
> > gradually increase as we read. What do you think?
> >
> > *re: Jorge*
> >
> > 3.5. Sure, I will aim to add my thoughts to the KIP
> >
> > 4. Let me check and come back to you on this one. I have a vague memory
> > this wasn't as easy to calculate, but if it is, I will include
> > RemoteDeleteBytesPerSec as well.
> >
> > 7. Yes, this is I believe a better explanation than the one I have in the
> > KIP, so I will aim to update it with your one. Thank you! LocalDeleteLag
> > makes sense to me as well, I will aim to include it.
> >
> > *re: Kamal*
> >
> > 1. I can add this to the KIP, but similar to what I have mentioned
> earlier,
> > I believe these are better left to plugin implementations, no?
> >
> > 2. Yeah, this makes sense.
> >
> > Best,
> > Christo
> >
> > On Fri, 10 Nov 2023 at 09:33, Satish Duggana <[email protected]>
> > wrote:
> >
> > > Thanks Christo for the KIP and the interesting discussion.
> > >
> > > 101. Adding metrics at partition level may increase the cardinality of
> > > these metrics. We should be cautious of that and see whether they are
> > > really needed. RLM related operations do not generally affect based on
> > > partition(s) but it is mostly because of the remote storage or broker
> > > level issues.
> > >
> > > 102. I am not sure whether the records metric is much useful when we
> > > have other bytes and segments related metrics available. If needed,
> > > records level information can be derived once we have segments/bytes
> > > metrics.
> > >
> > > 103. Regarding RemoteLogSizeComputationTime, we can add logs for
> > > debugging purposes to print the required duration while computing size
> > > instead of generating a metric. If there is any degradation in remote
> > > log size computation, it will have an effect on RLM task leading to
> > > remote log copy and delete lags.
> > >
> > > RLMM and RSM implementations can always add more metrics for
> > > observability based on the respective implementations.
> > >
> > > 104. What is the purpose of RemoteLogMetadataCount as a metric?
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Fri, 10 Nov 2023 at 04:10, Jorge Esteban Quilcate Otoya
> > > <[email protected]> wrote:
> > > >
> > > > Hi Christo,
> > > >
> > > > I'd like to add another suggestion:
> > > >
> > > > 7. Adding on TS lag formulas, my understanding is that per pertition:
> > > > - RemoteCopyLag: difference between: latest local segment candidate
> for
> > > > upload - latest remote segment
> > > >   - Represents how Remote Log Manager task is handling backlog of
> > > segments.
> > > >   - Ideally, this lag is zero -- grows when upload is slower than the
> > > > increase on candidate segments to upload
> > > >
> > > > - RemoteDeleteLag: difference between: latest remote candidate
> segment to
> > > > keep based on retention - oldest remote segment
> > > >   - Represents how many segments Remote Log Manager task is missing
> to
> > > > delete at a given point in time
> > > >   - Ideally, this lag is zero -- grows when retention condition
> changes
> > > but
> > > > RLM task is not able to schedule deletion yet.
> > > >
> > > > Is my understanding of these lags correct?
> > > >
> > > > I'd like to also consider an additional lag:
> > > > - LocalDeleteLag: difference between: latest local candidate segment
> to
> > > > keep based on local retention - oldest local segment
> > > >   - Represents how many segments are still available locally when
> they
> > > are
> > > > candidate for deletion. This usually happens when log cleaner has not
> > > been
> > > > scheduled yet. It's important because it represents how much data is
> > > stored
> > > > locally when it could be removed, and it represents how much data
> that
> > > can
> > > > be sourced from remote tier is still been sourced from local tier.
> > > >   - Ideally, this lag returns to zero when log cleaner runs; but
> could be
> > > > growing if there are issues uploading data (other lag) or removing
> data
> > > > internally.
> > > >
> > > > Thanks,
> > > > Jorge.
> > > >
> > > > On Thu, 9 Nov 2023 at 10:51, Luke Chen <[email protected]> wrote:
> > > >
> > > > > Hi Christo,
> > > > >
> > > > > Thanks for the KIP!
> > > > >
> > > > > Some comments:
> > > > > 1. I agree with Kamal that a metric to cover the time taken to read
> > > data
> > > > > from remote storage is helpful.
> > > > >
> > > > > 2. I can see there are some metrics are only on topic level, but
> some
> > > are
> > > > > on partition level.
> > > > > Could you explain why some of them are only on topic level?
> > > > > Like RemoteLogSizeComputationTime, it's different from partition to
> > > > > partition, will it be better to be exposed as partition metric?
> > > > >
> > > > > 3. `RemoteLogSizeBytes` metric hanging.
> > > > > To compute the RemoteLogSizeBytes, we need to wait until all
> records
> > > in the
> > > > > metadata topic loaded.
> > > > > What will happen if it takes long to load the data from metadata
> topic?
> > > > > Should we instead return -1 or something to indicate it's still
> > > loading?
> > > > >
> > > > > Thanks.
> > > > > Luke
> > > > >
> > > > > On Fri, Nov 3, 2023 at 1:53 AM Kamal Chandraprakash <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Hi Christo,
> > > > > >
> > > > > > Thanks for expanding the scope of the KIP!  We should also cover
> the
> > > time
> > > > > > taken to
> > > > > > read data from remote storage. This will give our users a fair
> idea
> > > about
> > > > > > the P99, P95,
> > > > > > and P50 Fetch latency to read data from remote storage.
> > > > > >
> > > > > > The Fetch API request metrics currently provides a breakdown of
> the
> > > time
> > > > > > spent on each item:
> > > > > >
> > > > > >
> > > > >
> > >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L517
> > > > > > Should we also provide `RemoteStorageTimeMs` item (only for FETCH
> > > API) so
> > > > > > that users can
> > > > > > understand the overall and per-step time taken?
> > > > > >
> > > > > > Regarding the Remote deletion metrics, should we also emit a
> metric
> > > to
> > > > > > expose the oldest segment time?
> > > > > > Users can configure the topic retention either by size (or)
> time. If
> > > time
> > > > > > is configured, then emitting
> > > > > > the oldest segment time allows the user to configure an alert on
> top
> > > of
> > > > > it
> > > > > > and act accordingly.
> > > > > >
> > > > > > On Wed, Nov 1, 2023 at 7:07 PM Jorge Esteban Quilcate Otoya <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Thanks, Christo!
> > > > > > >
> > > > > > > 1. Agree. Having a further look into how many latency metrics
> are
> > > > > > included
> > > > > > > on the broker side there are only a few of them (e.g. request
> > > > > lifecycle)
> > > > > > —
> > > > > > > but seems mostly delegated to clients, or plugin in this case,
> to
> > > > > measure
> > > > > > > this.
> > > > > > >
> > > > > > > 3.2. Personally, I find the record-based lag less useful as
> records
> > > > > can't
> > > > > > > be relied as a stable unit of measure. So, if we can keep
> bytes-
> > > and
> > > > > > > segment-based lag, LGTM.
> > > > > > > 3.4.  Agree, these metrics should be on the broker side.
> Though if
> > > > > plugin
> > > > > > > decides to take deletion as a background process, then it
> should
> > > have
> > > > > > it's
> > > > > > > own metrics. That's why I was thinking the calculation should
> be
> > > fairly
> > > > > > > similar to the CopyLag: "these segments are available for
> deletion
> > > but
> > > > > > > haven't been deleted yet"
> > > > > > > 3.5. For lag metrics: could we add an explanation on how each
> lag
> > > will
> > > > > be
> > > > > > > calculated, e.g. using which values, from which components,
> under
> > > which
> > > > > > > circumstances do we expect these values to increase/decrease,
> etc.
> > > This
> > > > > > > will clarify 3.4. and make it easier to agree and eventually
> test.
> > > > > > >
> > > > > > > 4. Sorry I wasn't clear. I meant similar to
> > > `RemoteCopyBytesPerSec` and
> > > > > > > `RemoteFetchBytesPerSec`, we could consider to include
> > > > > > > `RemoteDeleteBytesPerSec`.
> > > > > > >
> > > > > > > 5. and 6. Thanks for the explanation! It surely benefits to
> have
> > > these
> > > > > as
> > > > > > > part of the set of metrics.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jorge.
> > > > > > >
> > > > > > > On Mon, 30 Oct 2023 at 16:07, Christo Lolov <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Heya Jorge,
> > > > > > > >
> > > > > > > > Thank you for the insightful comments!
> > > > > > > >
> > > > > > > > 1. I see a value in such latency metrics but in my opinion
> the
> > > > > correct
> > > > > > > > location for such metrics is in the plugins providing the
> > > underlying
> > > > > > > > functionality. What are your thoughts on the matter?
> > > > > > > >
> > > > > > > > 2. Okay, I will look for and adjust the formatting
> > > today/tomorrow!
> > > > > > > >
> > > > > > > > 3.1 Done.
> > > > > > > > 3.2 Sure, I will add this to the KIP later today, the
> suggestion
> > > > > makes
> > > > > > > > sense to me. However, my question is, would you still find
> value
> > > in
> > > > > > > > emitting metrics for all three i.e. RemoteCopyLagRecords,
> > > > > > > > RemoteCopyLagBytes and RemoteCopyLagSegments or would you
> only
> > > keep
> > > > > > > > RemoteCopyLagBytes and RemoteCopyLagSegments?
> > > > > > > > 3.3. Yes, RemoteDeleteLagRecords was supposed to be an
> > > equivalent of
> > > > > > > > RemoteCopyLagRecords. Once I have your opinion on 3.2 I will
> > > make the
> > > > > > > > respective changes.
> > > > > > > > 3.4. I envision these metrics to be added to Kafka rather
> than
> > > the
> > > > > > > plugins.
> > > > > > > > Today Kafka sends deletes to remote storage but does not know
> > > whether
> > > > > > > those
> > > > > > > > segments have been deleted immediately when the request has
> been
> > > sent
> > > > > > or
> > > > > > > > have been given to a background process to carry out the
> actual
> > > > > > > reclamation
> > > > > > > > of space. The purpose of this metric is to give an estimate
> in
> > > time
> > > > > > which
> > > > > > > > says "hey, we have called this many segments or bytes to be
> > > deleted".
> > > > > > > >
> > > > > > > > 4. I believe this goes down the same line of thinking as
> what you
> > > > > > > mentioned
> > > > > > > > in 3.3 - have I misunderstood something?
> > > > > > > >
> > > > > > > > 5. I have on a number of occasions found I do not have a
> metric
> > > to
> > > > > > > quickly
> > > > > > > > point me to what part of tiered storage functionality is
> > > experiencing
> > > > > > an
> > > > > > > > issue, in some scenarios a follower failing to build an
> auxiliary
> > > > > > state.
> > > > > > > An
> > > > > > > > increase in the number of BuildRemoteLogAuxState requests per
> > > second
> > > > > > can
> > > > > > > > uncover problems for specific topics warranting a further
> > > > > > investigation,
> > > > > > > > which I tend to find difficult to judge purely based on
> parsing
> > > log
> > > > > > > > statements. An increase in the number of errors can quickly
> zone
> > > in
> > > > > on
> > > > > > > > followers failing as part of tiered storage and point me to
> look
> > > in
> > > > > the
> > > > > > > > logs specifically for that component.
> > > > > > > >
> > > > > > > > 6. I find it useful start my investigations with respect to
> > > tiering
> > > > > > > > problems by checking the rough size distribution of topics in
> > > remote.
> > > > > > > From
> > > > > > > > then on I try to correlate whether a historically high-volume
> > > topic
> > > > > > > started
> > > > > > > > experiencing a decrease in volume due to a decrease in
> produce
> > > > > traffic
> > > > > > to
> > > > > > > > that topic or due to an increase in lag on local storage due
> to
> > > the
> > > > > > > broker
> > > > > > > > slowing down for whatever reason. Besides correlation I
> would use
> > > > > such
> > > > > > a
> > > > > > > > metric to also confirm whether my rate calculations are
> correct
> > > i.e.
> > > > > if
> > > > > > > > topic A receives X MB/s and rolls a segment every Y seconds
> with
> > > an
> > > > > > > upload
> > > > > > > > rate of Z MB/s do I see that much data actually being
> written in
> > > > > remote
> > > > > > > > storage. Do these two scenarios demonstrate the usefulness I
> > > would
> > > > > have
> > > > > > > > from such a metric and do the benefits make sense to you?
> > > > > > > >
> > > > > > > > 7. I agree. I have changed TotalRemoteLogSizeComputationTime,
> > > > > > > > TotalRemoteLogSizeBytes, and TotalRemoteLogMetadataCount to
> > > > > > > > RemoteLogSizeComputationTime, RemoteLogSizeBytes and
> > > > > > > RemoteLogMetadataCount
> > > > > > > > respectively.
> > > > > > > >
> > > > > > > > On Fri, 27 Oct 2023 at 15:24, Jorge Esteban Quilcate Otoya <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > > > Hi Christo,
> > > > > > > > >
> > > > > > > > > Thanks for proposing KIP, this metrics will certainly be
> > > useful to
> > > > > > > > operate
> > > > > > > > > Kafka Tiered Storage as it becomes production-ready.
> > > > > > > > >
> > > > > > > > > 1. Given that the scope of the KIPs has grown to cover more
> > > > > metrics,
> > > > > > > what
> > > > > > > > > do you think about introducing latency metrics for RSM
> > > operations?
> > > > > > > > > Copy and delete time metrics are quite obvious/simple on
> what
> > > they
> > > > > > > > > represent; but fetch latency metrics would be helpful as
> remote
> > > > > > > fetching
> > > > > > > > > clients directly. e.g. having a "time to first byte" metric
> > > could
> > > > > > help
> > > > > > > to
> > > > > > > > > know how much time is introduced by the remote tier to
> start
> > > > > serving
> > > > > > > > > results to the consumer, or measuring how long it takes to
> > > return a
> > > > > > > > > response to consumers.
> > > > > > > > >
> > > > > > > > > 2. Couple of requests/nits on the metrics table, could you:
> > > > > > > > > - highlight the names (or have them on a separate column,
> as
> > > you
> > > > > > > prefer)
> > > > > > > > to
> > > > > > > > > make it easier to read? If you choose to have another
> column,
> > > maybe
> > > > > > > sort
> > > > > > > > > them as "Name, Description, MBean" and adjust the width.
> > > > > > > > > - group the related metrics in separate groups, e.g. Lag,
> > > Remote
> > > > > > > Delete,
> > > > > > > > > Remote Log Aux State, Remote Log Size; so we can elaborate
> on
> > > why
> > > > > > these
> > > > > > > > set
> > > > > > > > > of metrics are needed. Maybe adding some examples on usage
> and
> > > how
> > > > > > > > > actionable they are as the ones shared in previous emails
> > > would be
> > > > > > > useful
> > > > > > > > > to keep as part of the KIP.
> > > > > > > > >
> > > > > > > > > 3. On Lag metrics:
> > > > > > > > > 3.1 I would suggest the following renames:
> > > > > > > > > - TotalRemoteRecordsLag -> RemoteCopyLagRecords
> > > > > > > > > - TotalRemoteBytesLag -> RemoteCopyLagBytes
> > > > > > > > > - DeleteRemoteLag -> RemoteDeleteLagRecords
> > > > > > > > > 3.2. I agree with Kamal that having a lag based on the
> number
> > > of
> > > > > > > segments
> > > > > > > > > would be useful to include. Segments could give a faster
> proxy
> > > to
> > > > > > > > > understand whether the lag is meaningful or not. e.g. if
> the
> > > number
> > > > > > of
> > > > > > > > > records and bytes are high, but the segment lag is only
> small
> > > (e.g.
> > > > > > 1),
> > > > > > > > it
> > > > > > > > > may be ok; but if the number of segments is high, then it
> can
> > > be
> > > > > more
> > > > > > > > > relevant to operators.
> > > > > > > > > 3.3. Could we consider having the same metrics for Delete
> Lag
> > > as
> > > > > > there
> > > > > > > > are
> > > > > > > > > for Copy Lag? i.e. including RemoteDeleteLagBytes, and
> > > > > (potentially)
> > > > > > > > > RemoteDeleteLag for segments.
> > > > > > > > > 3.4. The description of delete lag is unclear to me: I
> though
> > > it
> > > > > was
> > > > > > > > about
> > > > > > > > > the remote segments to be deleted (because of total
> retention)
> > > but
> > > > > > not
> > > > > > > > > deleted yet; however from description it seems that it's
> > > related to
> > > > > > > local
> > > > > > > > > segments that are marked for deletion. Is this correct?
> > > > > > > > >
> > > > > > > > > 4. On Remote Delete metrics:
> > > > > > > > > - Could we also include bytes-based metric as with Copy and
> > > Fetch?
> > > > > t
> > > > > > > > would
> > > > > > > > > be useful to know how many bytes are being deleted. If
> > > aggregated
> > > > > and
> > > > > > > > > compared with copied bytes, we can get a sense of the
> amount of
> > > > > data
> > > > > > > > stored
> > > > > > > > > remotely, even if not exact (only considers segment size,
> not
> > > > > > indexes)
> > > > > > > > >
> > > > > > > > > 5. On RemoteLogAuxState metrics: could you elaborate a bit
> > > more on
> > > > > > the
> > > > > > > > > purpose of this component and why the metrics proposed are
> > > needed?
> > > > > > > > >
> > > > > > > > > 6. On Total Remote Log Size metrics: similarly, could you
> > > elaborate
> > > > > > on
> > > > > > > > why
> > > > > > > > > this metric is needed? I'm missing what makes this
> operation as
> > > > > > > relevant
> > > > > > > > > (compared to other internals) to have some metrics
> attached --
> > > > > maybe
> > > > > > if
> > > > > > > > you
> > > > > > > > > could shared scenarios where this metrics would be useful
> > > would be
> > > > > > > > helpful.
> > > > > > > > >
> > > > > > > > > 7. On the metrics naming: not sure the `Total*` prefix is
> > > really
> > > > > > needed
> > > > > > > > or
> > > > > > > > > adds meaning. When I found it useful is when there are
> related
> > > > > metric
> > > > > > > > that
> > > > > > > > > are a subset, then the total prefix helps: e.g.
> > > > > > > > > `TotalProduceRequestsPerSec` and
> `FailedProduceRequestsPerSec`
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Jorge.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, 24 Oct 2023 at 12:24, Christo Lolov <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello all,
> > > > > > > > > >
> > > > > > > > > > Now that 3.6 has been released, I would like to bring
> back
> > > > > > attention
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > following KIP for adding metrics to tiered storage
> targeting
> > > 3.7
> > > > > -
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Add+more+metrics+to+Tiered+Storage
> > > > > > > > > > .
> > > > > > > > > >
> > > > > > > > > > Let me know your thoughts about the list of metrics and
> their
> > > > > > > > > granularity!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Christo
> > > > > > > > > >
> > > > > > > > > > On Fri, 13 Oct 2023 at 10:14, Christo Lolov <
> > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Heya Gantigmaa,
> > > > > > > > > > >
> > > > > > > > > > > Apologies for the (very) late reply!
> > > > > > > > > > >
> > > > > > > > > > > Now that 3.6 has been released and reviewers have a bit
> > > more
> > > > > > time I
> > > > > > > > > will
> > > > > > > > > > > be picking up this KIP again. I am more than happy to
> add
> > > > > useful
> > > > > > > new
> > > > > > > > > > > metrics to the KIP, I would just ask for a couple of
> days
> > > to
> > > > > > review
> > > > > > > > > your
> > > > > > > > > > > pull request and I will come back to you.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Christo
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 25 Sept 2023 at 10:49, Gantigmaa Selenge <
> > > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Hi Christo,
> > > > > > > > > > >>
> > > > > > > > > > >> Thank you for writing the KIP.
> > > > > > > > > > >>
> > > > > > > > > > >> I recently raised a PR to add metrics for tracking
> remote
> > > > > > segment
> > > > > > > > > > >> deletions
> > > > > > > > > > >> (https://github.com/apache/kafka/pull/14375) but
> realised
> > > > > those
> > > > > > > > > metrics
> > > > > > > > > > >> were not mentioned in the original KIP-405 or
> KIP-930. Do
> > > you
> > > > > > > think
> > > > > > > > > > these
> > > > > > > > > > >> would make sense to be added to this KIP and get
> included
> > > in
> > > > > the
> > > > > > > > > > >> discussion?
> > > > > > > > > > >>
> > > > > > > > > > >> Regards,
> > > > > > > > > > >> Gantigmaa
> > > > > > > > > > >>
> > > > > > > > > > >> On Wed, Aug 9, 2023 at 1:53 PM Christo Lolov <
> > > > > > > > [email protected]>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Heya Kamal,
> > > > > > > > > > >> >
> > > > > > > > > > >> > Thank you for going through the KIP and for the
> > > question!
> > > > > > > > > > >> >
> > > > > > > > > > >> > I have been thinking about this and as an operator I
> > > might
> > > > > > find
> > > > > > > it
> > > > > > > > > the
> > > > > > > > > > >> most
> > > > > > > > > > >> > useful to know all three of them actually.
> > > > > > > > > > >> >
> > > > > > > > > > >> > I would find knowing the size in bytes useful to
> > > determine
> > > > > how
> > > > > > > > much
> > > > > > > > > > >> disk I
> > > > > > > > > > >> > might need to add temporarily to compensate for the
> > > > > slowdown.
> > > > > > > > > > >> > I would find knowing the number of records useful,
> > > because
> > > > > > using
> > > > > > > > the
> > > > > > > > > > >> > MessagesInPerSec metric I would be able to determine
> > > how old
> > > > > > the
> > > > > > > > > > records
> > > > > > > > > > >> > which are facing problems are.
> > > > > > > > > > >> > I would find knowing the number of segments useful
> > > because I
> > > > > > > would
> > > > > > > > > be
> > > > > > > > > > >> able
> > > > > > > > > > >> > to correlate this with whether I need to change
> > > > > > > > > > >> > *remote.log.manager.task.interval.ms
> > > > > > > > > > >> > <http://remote.log.manager.task.interval.ms> *to a
> > > lower or
> > > > > > > > higher
> > > > > > > > > > >> value.
> > > > > > > > > > >> >
> > > > > > > > > > >> > What are your thoughts on the above? Would you find
> > > some of
> > > > > > them
> > > > > > > > > more
> > > > > > > > > > >> > useful than others?
> > > > > > > > > > >> >
> > > > > > > > > > >> > Best,
> > > > > > > > > > >> > Christo
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Tue, 8 Aug 2023 at 16:43, Kamal Chandraprakash <
> > > > > > > > > > >> > [email protected]> wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > Hi Christo,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks for the KIP!
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > The proposed tiered storage metrics are useful.
> The
> > > unit
> > > > > > > > mentioned
> > > > > > > > > > in
> > > > > > > > > > >> the
> > > > > > > > > > >> > > KIP is the number of records.
> > > > > > > > > > >> > > Each topic can have varying amounts of records in
> a
> > > > > segment
> > > > > > > > > > depending
> > > > > > > > > > >> on
> > > > > > > > > > >> > > the record size.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Do you think having the tier-lag by number of
> segments
> > > > > (or)
> > > > > > > size
> > > > > > > > > of
> > > > > > > > > > >> > > segments in bytes will be useful
> > > > > > > > > > >> > > to the operator?
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > > Kamal
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov <
> > > > > > > > > > [email protected]>
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Hello all!
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > I would like to start a discussion for KIP-963:
> > > Upload
> > > > > and
> > > > > > > > > delete
> > > > > > > > > > >> lag
> > > > > > > > > > >> > > > metrics in Tiered Storage (
> > > > > > > > > > >> > https://cwiki.apache.org/confluence/x/sZGzDw
> > > > > > > > > > >> > > ).
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > The purpose of this KIP is to introduce a
> couple of
> > > > > > metrics
> > > > > > > to
> > > > > > > > > > track
> > > > > > > > > > >> > lag
> > > > > > > > > > >> > > > with respect to remote storage from the point of
> > > view of
> > > > > > > > Kafka.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Thanks in advance for leaving a review!
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Best,
> > > > > > > > > > >> > > > Christo
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
>

Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

Reply via email to