Re: [DISCUSS] KIP-971 Expose replication-offset-lag MirrorMaker2 metric

Mickael Maison Mon, 10 Feb 2025 10:41:50 -0800

Hi,

I realized I did not include the MBeans attribute in my previous
email. I meant the records-lag attribute from
kafka.consumer:type=consumer-fetch-manager-metrics,partition="{partition}",topic="{topic}",client-id="{client-id}".
Quickly looking at the code, that metric seems to come from the same
code path as consumer.currentLag().


Thanks,
Mickael

On Mon, Feb 10, 2025 at 7:30 PM Dániel Urbán <urb.dani...@gmail.com> wrote:
>
> Hi,
> I'm very late to this discussion, but I saw the part about the performance
> impact of fetching the LEOs - I was curious how that was exactly
> experienced? Was it using consumer.endOffsets()?
> There is also a consumer.currentLag() method, which uses locally-cached LEO
> info. It might not be as precise as frequent end offset queries, and it
> does not guarantee a return value on each call, but in my experience it
> doesn't cause the same overhead as the .endOffsets() call, so it might be
> worth a shot.
> Daniel
>
>
> Mickael Maison <mickael.mai...@gmail.com> ezt írta (időpont: 2025. febr.
> 10., H, 16:44):
>
> > Hi,
> >
> > I looked at this again today and I was wondering how if would compare
> > against the lag reported by consumers. Could we use the
> >
> > kafka.consumer:type=consumer-fetch-manager-metrics,partition="{partition}",topic="{topic}",client-id="{client-id}"
> > metric that already exists? If not, let's add this to the rejected
> > alternatives and explain why.
> >
> > Thanks,
> > Mickael
> >
> > On Mon, Jul 29, 2024 at 4:26 PM Mickael Maison <mickael.mai...@gmail.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > > My thinking is that any partition could go stale if there are no
> > records
> > > being produced into it.
> > >
> > > 1. Why would the value become stale if there are no new records? The
> > > lag should stay the same, no?
> > >
> > > > If enough of such partitions are present and are owned by a single MM
> > task, an OOM could happen.
> > >
> > > 2. We already have a dozen of metrics per partition [0]. Why do you
> > > think adding a few more would cause OutOfMemory errors?
> > > Each task should only emit metrics for partitions it owns.
> > >
> > > > Regarding the scenario where the TTL value is lower than the refresh
> > interval - I believe that this is an edge that we need to document and
> > prevent against, for example either failing to start on such a combination
> > or resorting to a default value that would satisfy the constraint and
> > logging an error.
> > >
> > > 3. Can you add the behavior you propose in the KIP?
> > >
> > > Thanks,
> > > Mickael
> > >
> > > 0:
> > https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceMetrics.java#L71-L106
> > >
> > > On Wed, May 22, 2024 at 9:18 PM Elxan Eminov <elxanemino...@gmail.com>
> > wrote:
> > > >
> > > > Hey Mickael,
> > > > Just checking to see if you have any thoughts on this.
> > > > thanks!
> > > >
> > > > On Thu, 11 Apr 2024 at 15:11, Elxan Eminov <elxanemino...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Mickael!
> > > > > Any thoughts on this?
> > > > > Thanks!
> > > > >
> > > > > On Wed, 3 Apr 2024 at 13:21, Elxan Eminov <elxanemino...@gmail.com>
> > wrote:
> > > > >
> > > > >> Hi Mickael,
> > > > >> Thanks for your response and apologies for a huge delay in mine.
> > > > >>
> > > > >> My thinking is that any partition could go stale if there are no
> > records
> > > > >> being produced into it. If enough of such partitions are present
> > and are
> > > > >> owned by a single MM task, an OOM could happen.
> > > > >>
> > > > >> Regarding the scenario where the TTL value is lower than the refresh
> > > > >> interval - I believe that this is an edge that we need to document
> > and
> > > > >> prevent against, for example either failing to start on such a
> > combination
> > > > >> or resorting to a default value that would satisfy the constraint
> > and
> > > > >> logging an error.
> > > > >>
> > > > >> Thanks,
> > > > >> Elkhan
> > > > >>
> > > > >> On Thu, 8 Feb 2024 at 14:17, Mickael Maison <
> > mickael.mai...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> Thanks for the updates.
> > > > >>> I'm wondering whether we really need the ttl eviction mechanism.
> > The
> > > > >>> motivation is to "avoid storing stale LRO entries which can cause
> > an
> > > > >>> eventual OOM error". How could it contain stake entries? I would
> > > > >>> expect its cache to only contain entries for partitions assigned to
> > > > >>> the task that owns it. Also what is the expected behavior if
> > there's
> > > > >>> no available LRO in the cache? If we keep this mechanism what
> > happens
> > > > >>> if its value is lower than
> > > > >>> replication.record.lag.metric.refresh.interval?
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Mickael
> > > > >>>
> > > > >>> On Mon, Feb 5, 2024 at 5:23 PM Elxan Eminov <
> > elxanemino...@gmail.com>
> > > > >>> wrote:
> > > > >>> >
> > > > >>> > Hi Mickael!
> > > > >>> > Any further thoughts on this?
> > > > >>> >
> > > > >>> > Thanks,
> > > > >>> > Elkhan
> > > > >>> >
> > > > >>> > On Thu, 18 Jan 2024 at 11:53, Mickael Maison <
> > mickael.mai...@gmail.com
> > > > >>> >
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> > > Hi Elxan,
> > > > >>> > >
> > > > >>> > > Thanks for the updates.
> > > > >>> > >
> > > > >>> > > We used dots to separate words in configuration names, so I
> > think
> > > > >>> > > replication.offset.lag.metric.last-replicated-offset.ttl
> > should be
> > > > >>> > > named replication.offset.lag.metric.last.replicated.offset.ttl
> > > > >>> > > instead.
> > > > >>> > >
> > > > >>> > > About the names of the metrics, fair enough if you prefer
> > keeping the
> > > > >>> > > replication prefix. Out of the alternatives you mentioned, I
> > think I
> > > > >>> > > prefer replication-record-lag. I think the metrics and
> > configuration
> > > > >>> > > names should match too. Let's see what the others think about
> > it.
> > > > >>> > >
> > > > >>> > > Thanks,
> > > > >>> > > Mickael
> > > > >>> > >
> > > > >>> > > On Mon, Jan 15, 2024 at 9:50 PM Elxan Eminov <
> > > > >>> elxanemino...@gmail.com>
> > > > >>> > > wrote:
> > > > >>> > > >
> > > > >>> > > > Apologies, forgot to reply on your last comment about the
> > metric
> > > > >>> name.
> > > > >>> > > > I believe both replication-lag and record-lag are a little
> > too
> > > > >>> abstract -
> > > > >>> > > > what do you think about either leaving it as
> > > > >>> replication-offset-lag or
> > > > >>> > > > renaming to replication-record-lag?
> > > > >>> > > >
> > > > >>> > > > Thanks
> > > > >>> > > >
> > > > >>> > > > On Wed, 10 Jan 2024 at 15:31, Mickael Maison <
> > > > >>> mickael.mai...@gmail.com>
> > > > >>> > > > wrote:
> > > > >>> > > >
> > > > >>> > > > > Hi Elxan,
> > > > >>> > > > >
> > > > >>> > > > > Thanks for the KIP, it looks like a useful addition.
> > > > >>> > > > >
> > > > >>> > > > > Can you add to the KIP the default value you propose for
> > > > >>> > > > > replication.lag.metric.refresh.interval? In MirrorMaker
> > most
> > > > >>> interval
> > > > >>> > > > > configs can be set to -1 to disable them, will it be the
> > case
> > > > >>> for this
> > > > >>> > > > > new feature or will this setting only accept positive
> > values?
> > > > >>> > > > > I also wonder if replication-lag, or record-lag would be
> > clearer
> > > > >>> names
> > > > >>> > > > > instead of replication-offset-lag, WDYT?
> > > > >>> > > > >
> > > > >>> > > > > Thanks,
> > > > >>> > > > > Mickael
> > > > >>> > > > >
> > > > >>> > > > > On Wed, Jan 3, 2024 at 6:15 PM Elxan Eminov <
> > > > >>> elxanemino...@gmail.com>
> > > > >>> > > > > wrote:
> > > > >>> > > > > >
> > > > >>> > > > > > Hi all,
> > > > >>> > > > > > Here is the vote thread:
> > > > >>> > > > > >
> > > > >>> https://lists.apache.org/thread/ftlnolcrh858dry89sjg06mdcdj9mrqv
> > > > >>> > > > > >
> > > > >>> > > > > > Cheers!
> > > > >>> > > > > >
> > > > >>> > > > > > On Wed, 27 Dec 2023 at 11:23, Elxan Eminov <
> > > > >>> elxanemino...@gmail.com>
> > > > >>> > > > > wrote:
> > > > >>> > > > > >
> > > > >>> > > > > > > Hi all,
> > > > >>> > > > > > > I've updated the KIP with the details we discussed in
> > this
> > > > >>> thread.
> > > > >>> > > > > > > I'll call in a vote after the holidays if everything
> > looks
> > > > >>> good.
> > > > >>> > > > > > > Thanks!
> > > > >>> > > > > > >
> > > > >>> > > > > > > On Sat, 26 Aug 2023 at 15:49, Elxan Eminov <
> > > > >>> > > elxanemino...@gmail.com>
> > > > >>> > > > > > > wrote:
> > > > >>> > > > > > >
> > > > >>> > > > > > >> Relatively minor change with a new metric for MM2
> > > > >>> > > > > > >>
> > > > >>> > > > > > >>
> > > > >>> > > > >
> > > > >>> > >
> > > > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-971%3A+Expose+replication-offset-lag+MirrorMaker2+metric
> > > > >>> > > > > > >>
> > > > >>> > > > > > >
> > > > >>> > > > >
> > > > >>> > >
> > > > >>>
> > > > >>
> >

Re: [DISCUSS] KIP-971 Expose replication-offset-lag MirrorMaker2 metric

Reply via email to