Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-06-12 Thread Federico Valeri
Hi Luke, thanks for the KIP.

I think we miss the "dir" key in "remainingLogsToRecover" ObjectName.

kafka.log:type=LogManager,name=remainingLogsToRecover,dir=([-._\/\w\d\s]+)
kafka.log:type=LogManager,name=remainingSegmentsToRecover,dir=([-._\/\w\d\s]+),threadNum=([0-9]+)

Example:

Broker configuration:
log.dirs=/tmp/log1,/tmp/log2
num.recovery.threads.per.data.dir=2

Registered ObjectNames:
kafka.log:type=LogManager,name=remainingLogsToRecover,dir=/tmp/log1
kafka.log:type=LogManager,name=remainingLogsToRecover,dir=/tmp/log2
kafka.log:type=LogManager,name=remainingSegmentsToRecover,dir=/tmp/log1,threadNum=0
kafka.log:type=LogManager,name=remainingSegmentsToRecover,dir=/tmp/log1,threadNum=1
kafka.log:type=LogManager,name=remainingSegmentsToRecover,dir=/tmp/log2,threadNum=0
kafka.log:type=LogManager,name=remainingSegmentsToRecover,dir=/tmp/log2,threadNum=1

On Sat, Jun 4, 2022 at 4:42 AM Luke Chen  wrote:
>
> Hi Jun,
>
> Thanks for the comment.
>
> I've updated the KIP as:
> 1. remainingLogsToRecover*y* -> remainingLogsToRecover
> 2. remainingSegmentsToRecover*y* -> remainingSegmentsToRecover
> 3. The description of remainingSegmentsToRecover: The remaining segments
> for the current log assigned to the recovery thread.
>
> Thank you.
> Luke
>
> On Sat, Jun 4, 2022 at 12:54 AM Jun Rao  wrote:
>
> > Hi, Luke,
> >
> > Thanks for the explanation.
> >
> > 10. It makes sense to me now. Instead of using a longer name, perhaps we
> > could keep the current name, but make the description clear that it's the
> > remaining segments for the current log assigned to a thread. Also, would it
> > be better to use ToRecover instead of ToRecovery?
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Jun 3, 2022 at 1:18 AM Luke Chen  wrote:
> >
> > > Hi Jun,
> > >
> > > > how do we implement kafka.log
> > >
> > >
> > :type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
> > > which tracks at the segment level?
> > >
> > > It looks like the name is misleading.
> > > Suppose we have 2 log recovery threads
> > > (num.recovery.threads.per.data.dir=2),
> > > and 10 logs to iterate through
> > > As mentioned before, we don't know how many segments in each log until
> > the
> > > log is iterated(loaded)
> > > So, when thread-1 iterates logA, it gets the segments to recover, and
> > > expose the number to `remainingSegmentsToRecovery` metric.
> > > And the same for thread-2 iterating logB.
> > > That is, the metric showed in `remainingSegmentsToRecovery` is actually
> > the
> > > remaining segments to recover "in a specific log".
> > >
> > > Maybe I should rename it: remainingSegmentsToRecovery ->
> > > remainingSegmentsToRecoverInCurrentLog
> > > WDYT?
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Fri, Jun 3, 2022 at 1:27 AM Jun Rao  wrote:
> > >
> > > > Hi, Luke,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 10. You are saying it's difficult to track the number of segments to
> > > > recover. But how do we
> > > > implement
> > > >
> > >
> > kafka.log:type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
> > > > which tracks at the segment level?
> > > >
> > > > Jun
> > > >
> > > > On Thu, Jun 2, 2022 at 3:39 AM Luke Chen  wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thanks for the comment.
> > > > >
> > > > > Yes, I've tried to work on this way to track the number of remaining
> > > > > segments, but it will change the design in UnifiedLog, so I only
> > track
> > > > the
> > > > > logs number.
> > > > > Currently, we will load all segments and recover those segments if
> > > needed
> > > > > "during creating UnifiedLog instance". And also get the log offsets
> > > here
> > > > > <
> > > > >
> > > >
> > >
> > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/UnifiedLog.scala#L1819-L1842
> > > > > >
> > > > > .
> > > > > That is, if we want to get all segments to be recovered before
> > running
> > > > log
> > > > > recovery, we need to break the logic in UnifiedLog, to create a
> > partial
> > > > > UnifiedLog instance, and add more info to it later, which I think is
> > > just
> > > > > making the codes more complicated.
> > > > >
> > > > >
> > > > > Thank you.
> > > > > Luke
> > > > >
> > > > >
> > > > >
> > > > > On Thu, May 26, 2022 at 2:57 AM Jun Rao 
> > > > wrote:
> > > > >
> > > > > > Hi, Luke,
> > > > > >
> > > > > > Thanks for the KIP. Just one comment.
> > > > > >
> > > > > > 10. For kafka.log:type=LogManager,name=remainingLogsToRecovery,
> > could
> > > > we
> > > > > > instead track the number of remaining segments? This monitors the
> > > > > progress
> > > > > > at a finer granularity and is also consistent with the thread level
> > > > > metric.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Wed, May 25, 2022 at 7:47 AM Tom Bentley 
> > > > wrote:
> > > > > >
> > > > > > > Thanks Luke! LGTM.
> > > > > > >
> > > > > > > On Sun, 22 May 2022 at 05:18, Luke Chen 
> > 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-06-03 Thread Luke Chen
Hi Jun,

Thanks for the comment.

I've updated the KIP as:
1. remainingLogsToRecover*y* -> remainingLogsToRecover
2. remainingSegmentsToRecover*y* -> remainingSegmentsToRecover
3. The description of remainingSegmentsToRecover: The remaining segments
for the current log assigned to the recovery thread.

Thank you.
Luke

On Sat, Jun 4, 2022 at 12:54 AM Jun Rao  wrote:

> Hi, Luke,
>
> Thanks for the explanation.
>
> 10. It makes sense to me now. Instead of using a longer name, perhaps we
> could keep the current name, but make the description clear that it's the
> remaining segments for the current log assigned to a thread. Also, would it
> be better to use ToRecover instead of ToRecovery?
>
> Thanks,
>
> Jun
>
> On Fri, Jun 3, 2022 at 1:18 AM Luke Chen  wrote:
>
> > Hi Jun,
> >
> > > how do we implement kafka.log
> >
> >
> :type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
> > which tracks at the segment level?
> >
> > It looks like the name is misleading.
> > Suppose we have 2 log recovery threads
> > (num.recovery.threads.per.data.dir=2),
> > and 10 logs to iterate through
> > As mentioned before, we don't know how many segments in each log until
> the
> > log is iterated(loaded)
> > So, when thread-1 iterates logA, it gets the segments to recover, and
> > expose the number to `remainingSegmentsToRecovery` metric.
> > And the same for thread-2 iterating logB.
> > That is, the metric showed in `remainingSegmentsToRecovery` is actually
> the
> > remaining segments to recover "in a specific log".
> >
> > Maybe I should rename it: remainingSegmentsToRecovery ->
> > remainingSegmentsToRecoverInCurrentLog
> > WDYT?
> >
> > Thank you.
> > Luke
> >
> > On Fri, Jun 3, 2022 at 1:27 AM Jun Rao  wrote:
> >
> > > Hi, Luke,
> > >
> > > Thanks for the reply.
> > >
> > > 10. You are saying it's difficult to track the number of segments to
> > > recover. But how do we
> > > implement
> > >
> >
> kafka.log:type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
> > > which tracks at the segment level?
> > >
> > > Jun
> > >
> > > On Thu, Jun 2, 2022 at 3:39 AM Luke Chen  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thanks for the comment.
> > > >
> > > > Yes, I've tried to work on this way to track the number of remaining
> > > > segments, but it will change the design in UnifiedLog, so I only
> track
> > > the
> > > > logs number.
> > > > Currently, we will load all segments and recover those segments if
> > needed
> > > > "during creating UnifiedLog instance". And also get the log offsets
> > here
> > > > <
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/UnifiedLog.scala#L1819-L1842
> > > > >
> > > > .
> > > > That is, if we want to get all segments to be recovered before
> running
> > > log
> > > > recovery, we need to break the logic in UnifiedLog, to create a
> partial
> > > > UnifiedLog instance, and add more info to it later, which I think is
> > just
> > > > making the codes more complicated.
> > > >
> > > >
> > > > Thank you.
> > > > Luke
> > > >
> > > >
> > > >
> > > > On Thu, May 26, 2022 at 2:57 AM Jun Rao 
> > > wrote:
> > > >
> > > > > Hi, Luke,
> > > > >
> > > > > Thanks for the KIP. Just one comment.
> > > > >
> > > > > 10. For kafka.log:type=LogManager,name=remainingLogsToRecovery,
> could
> > > we
> > > > > instead track the number of remaining segments? This monitors the
> > > > progress
> > > > > at a finer granularity and is also consistent with the thread level
> > > > metric.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Wed, May 25, 2022 at 7:47 AM Tom Bentley 
> > > wrote:
> > > > >
> > > > > > Thanks Luke! LGTM.
> > > > > >
> > > > > > On Sun, 22 May 2022 at 05:18, Luke Chen 
> wrote:
> > > > > >
> > > > > > > Hi Tom and Raman,
> > > > > > >
> > > > > > > Thanks for your comments.
> > > > > > >
> > > > > > > > 1. There's not a JIRA for this KIP (or the JIRA link needs
> > > > updating).
> > > > > > > 2. Similarly the link to this discussion thread needs updating.
> > > > > > > > Please update the links to JIRA and the discussion thread.
> > > > > > >
> > > > > > > Yes, thanks for the reminder. I've updated the KIP.
> > > > > > >
> > > > > > > > 3. I wonder whether we need to keep these metrics (with value
> > 0)
> > > > once
> > > > > > the
> > > > > > > broker enters the running state. Do you see it as valuable? A
> > > benefit
> > > > > of
> > > > > > > removing the metrics would be a reduction on storage required
> for
> > > > > metric
> > > > > > > stores which are recording these metrics.
> > > > > > >
> > > > > > > Yes, removing the metrics after log recovery completed is a
> good
> > > > idea.
> > > > > > > Updated the KIP.
> > > > > > >
> > > > > > > > 4. I think the KIP's public interfaces section could be a bit
> > > > > clearer.
> > > > > > > Previous KIPs which added metrics usually used a table, with
> the
> > > > MBean
> > > > > > > name, 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-06-03 Thread Jun Rao
Hi, Luke,

Thanks for the explanation.

10. It makes sense to me now. Instead of using a longer name, perhaps we
could keep the current name, but make the description clear that it's the
remaining segments for the current log assigned to a thread. Also, would it
be better to use ToRecover instead of ToRecovery?

Thanks,

Jun

On Fri, Jun 3, 2022 at 1:18 AM Luke Chen  wrote:

> Hi Jun,
>
> > how do we implement kafka.log
>
> :type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
> which tracks at the segment level?
>
> It looks like the name is misleading.
> Suppose we have 2 log recovery threads
> (num.recovery.threads.per.data.dir=2),
> and 10 logs to iterate through
> As mentioned before, we don't know how many segments in each log until the
> log is iterated(loaded)
> So, when thread-1 iterates logA, it gets the segments to recover, and
> expose the number to `remainingSegmentsToRecovery` metric.
> And the same for thread-2 iterating logB.
> That is, the metric showed in `remainingSegmentsToRecovery` is actually the
> remaining segments to recover "in a specific log".
>
> Maybe I should rename it: remainingSegmentsToRecovery ->
> remainingSegmentsToRecoverInCurrentLog
> WDYT?
>
> Thank you.
> Luke
>
> On Fri, Jun 3, 2022 at 1:27 AM Jun Rao  wrote:
>
> > Hi, Luke,
> >
> > Thanks for the reply.
> >
> > 10. You are saying it's difficult to track the number of segments to
> > recover. But how do we
> > implement
> >
> kafka.log:type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
> > which tracks at the segment level?
> >
> > Jun
> >
> > On Thu, Jun 2, 2022 at 3:39 AM Luke Chen  wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for the comment.
> > >
> > > Yes, I've tried to work on this way to track the number of remaining
> > > segments, but it will change the design in UnifiedLog, so I only track
> > the
> > > logs number.
> > > Currently, we will load all segments and recover those segments if
> needed
> > > "during creating UnifiedLog instance". And also get the log offsets
> here
> > > <
> > >
> >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/UnifiedLog.scala#L1819-L1842
> > > >
> > > .
> > > That is, if we want to get all segments to be recovered before running
> > log
> > > recovery, we need to break the logic in UnifiedLog, to create a partial
> > > UnifiedLog instance, and add more info to it later, which I think is
> just
> > > making the codes more complicated.
> > >
> > >
> > > Thank you.
> > > Luke
> > >
> > >
> > >
> > > On Thu, May 26, 2022 at 2:57 AM Jun Rao 
> > wrote:
> > >
> > > > Hi, Luke,
> > > >
> > > > Thanks for the KIP. Just one comment.
> > > >
> > > > 10. For kafka.log:type=LogManager,name=remainingLogsToRecovery, could
> > we
> > > > instead track the number of remaining segments? This monitors the
> > > progress
> > > > at a finer granularity and is also consistent with the thread level
> > > metric.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Wed, May 25, 2022 at 7:47 AM Tom Bentley 
> > wrote:
> > > >
> > > > > Thanks Luke! LGTM.
> > > > >
> > > > > On Sun, 22 May 2022 at 05:18, Luke Chen  wrote:
> > > > >
> > > > > > Hi Tom and Raman,
> > > > > >
> > > > > > Thanks for your comments.
> > > > > >
> > > > > > > 1. There's not a JIRA for this KIP (or the JIRA link needs
> > > updating).
> > > > > > 2. Similarly the link to this discussion thread needs updating.
> > > > > > > Please update the links to JIRA and the discussion thread.
> > > > > >
> > > > > > Yes, thanks for the reminder. I've updated the KIP.
> > > > > >
> > > > > > > 3. I wonder whether we need to keep these metrics (with value
> 0)
> > > once
> > > > > the
> > > > > > broker enters the running state. Do you see it as valuable? A
> > benefit
> > > > of
> > > > > > removing the metrics would be a reduction on storage required for
> > > > metric
> > > > > > stores which are recording these metrics.
> > > > > >
> > > > > > Yes, removing the metrics after log recovery completed is a good
> > > idea.
> > > > > > Updated the KIP.
> > > > > >
> > > > > > > 4. I think the KIP's public interfaces section could be a bit
> > > > clearer.
> > > > > > Previous KIPs which added metrics usually used a table, with the
> > > MBean
> > > > > > name, metric type and description. SeeKIP-551 for example (or
> > > KIP-748,
> > > > > > KIP-608). Similarly you could use a table in the proposed changes
> > > > section
> > > > > > rather than describing the tree you'd see in an MBean console.
> > > > > >
> > > > > > Good point! Updated the KIP to use a table to list the MBean
> name,
> > > > metric
> > > > > > type and descriptions.
> > > > > >
> > > > > >
> > > > > > Thank you.
> > > > > > Luke
> > > > > >
> > > > > > On Fri, May 20, 2022 at 9:13 AM Raman Verma
> > > >  > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Luke,
> > > > > > >
> > > > > > > The change is useful and simple. Thanks.
> > > > > > > Please 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-06-03 Thread Luke Chen
Hi Jun,

> how do we implement kafka.log
:type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
which tracks at the segment level?

It looks like the name is misleading.
Suppose we have 2 log recovery threads (num.recovery.threads.per.data.dir=2),
and 10 logs to iterate through
As mentioned before, we don't know how many segments in each log until the
log is iterated(loaded)
So, when thread-1 iterates logA, it gets the segments to recover, and
expose the number to `remainingSegmentsToRecovery` metric.
And the same for thread-2 iterating logB.
That is, the metric showed in `remainingSegmentsToRecovery` is actually the
remaining segments to recover "in a specific log".

Maybe I should rename it: remainingSegmentsToRecovery ->
remainingSegmentsToRecoverInCurrentLog
WDYT?

Thank you.
Luke

On Fri, Jun 3, 2022 at 1:27 AM Jun Rao  wrote:

> Hi, Luke,
>
> Thanks for the reply.
>
> 10. You are saying it's difficult to track the number of segments to
> recover. But how do we
> implement
> kafka.log:type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
> which tracks at the segment level?
>
> Jun
>
> On Thu, Jun 2, 2022 at 3:39 AM Luke Chen  wrote:
>
> > Hi Jun,
> >
> > Thanks for the comment.
> >
> > Yes, I've tried to work on this way to track the number of remaining
> > segments, but it will change the design in UnifiedLog, so I only track
> the
> > logs number.
> > Currently, we will load all segments and recover those segments if needed
> > "during creating UnifiedLog instance". And also get the log offsets here
> > <
> >
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/UnifiedLog.scala#L1819-L1842
> > >
> > .
> > That is, if we want to get all segments to be recovered before running
> log
> > recovery, we need to break the logic in UnifiedLog, to create a partial
> > UnifiedLog instance, and add more info to it later, which I think is just
> > making the codes more complicated.
> >
> >
> > Thank you.
> > Luke
> >
> >
> >
> > On Thu, May 26, 2022 at 2:57 AM Jun Rao 
> wrote:
> >
> > > Hi, Luke,
> > >
> > > Thanks for the KIP. Just one comment.
> > >
> > > 10. For kafka.log:type=LogManager,name=remainingLogsToRecovery, could
> we
> > > instead track the number of remaining segments? This monitors the
> > progress
> > > at a finer granularity and is also consistent with the thread level
> > metric.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, May 25, 2022 at 7:47 AM Tom Bentley 
> wrote:
> > >
> > > > Thanks Luke! LGTM.
> > > >
> > > > On Sun, 22 May 2022 at 05:18, Luke Chen  wrote:
> > > >
> > > > > Hi Tom and Raman,
> > > > >
> > > > > Thanks for your comments.
> > > > >
> > > > > > 1. There's not a JIRA for this KIP (or the JIRA link needs
> > updating).
> > > > > 2. Similarly the link to this discussion thread needs updating.
> > > > > > Please update the links to JIRA and the discussion thread.
> > > > >
> > > > > Yes, thanks for the reminder. I've updated the KIP.
> > > > >
> > > > > > 3. I wonder whether we need to keep these metrics (with value 0)
> > once
> > > > the
> > > > > broker enters the running state. Do you see it as valuable? A
> benefit
> > > of
> > > > > removing the metrics would be a reduction on storage required for
> > > metric
> > > > > stores which are recording these metrics.
> > > > >
> > > > > Yes, removing the metrics after log recovery completed is a good
> > idea.
> > > > > Updated the KIP.
> > > > >
> > > > > > 4. I think the KIP's public interfaces section could be a bit
> > > clearer.
> > > > > Previous KIPs which added metrics usually used a table, with the
> > MBean
> > > > > name, metric type and description. SeeKIP-551 for example (or
> > KIP-748,
> > > > > KIP-608). Similarly you could use a table in the proposed changes
> > > section
> > > > > rather than describing the tree you'd see in an MBean console.
> > > > >
> > > > > Good point! Updated the KIP to use a table to list the MBean name,
> > > metric
> > > > > type and descriptions.
> > > > >
> > > > >
> > > > > Thank you.
> > > > > Luke
> > > > >
> > > > > On Fri, May 20, 2022 at 9:13 AM Raman Verma
> > >  > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Luke,
> > > > > >
> > > > > > The change is useful and simple. Thanks.
> > > > > > Please update the links to JIRA and the discussion thread.
> > > > > >
> > > > > > Best Regards,
> > > > > > Raman Verma
> > > > > >
> > > > > > On Thu, May 19, 2022 at 8:57 AM Tom Bentley  >
> > > > wrote:
> > > > > > >
> > > > > > > Hi Luke,
> > > > > > >
> > > > > > > Thanks for the KIP. I think the idea makes sense and would
> > provide
> > > > > useful
> > > > > > > observability of log recovery. I have a few comments.
> > > > > > >
> > > > > > > 1. There's not a JIRA for this KIP (or the JIRA link needs
> > > updating).
> > > > > > > 2. Similarly the link to this discussion thread needs updating.
> > > > > > > 3. I wonder whether we need to keep these metrics (with value

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-06-02 Thread Jun Rao
Hi, Luke,

Thanks for the reply.

10. You are saying it's difficult to track the number of segments to
recover. But how do we
implement 
kafka.log:type=LogManager,name=remainingSegmentsToRecovery,dir=([-._\/\w\d\s]+),threadNum=([0-9]+),
which tracks at the segment level?

Jun

On Thu, Jun 2, 2022 at 3:39 AM Luke Chen  wrote:

> Hi Jun,
>
> Thanks for the comment.
>
> Yes, I've tried to work on this way to track the number of remaining
> segments, but it will change the design in UnifiedLog, so I only track the
> logs number.
> Currently, we will load all segments and recover those segments if needed
> "during creating UnifiedLog instance". And also get the log offsets here
> <
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/UnifiedLog.scala#L1819-L1842
> >
> .
> That is, if we want to get all segments to be recovered before running log
> recovery, we need to break the logic in UnifiedLog, to create a partial
> UnifiedLog instance, and add more info to it later, which I think is just
> making the codes more complicated.
>
>
> Thank you.
> Luke
>
>
>
> On Thu, May 26, 2022 at 2:57 AM Jun Rao  wrote:
>
> > Hi, Luke,
> >
> > Thanks for the KIP. Just one comment.
> >
> > 10. For kafka.log:type=LogManager,name=remainingLogsToRecovery, could we
> > instead track the number of remaining segments? This monitors the
> progress
> > at a finer granularity and is also consistent with the thread level
> metric.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, May 25, 2022 at 7:47 AM Tom Bentley  wrote:
> >
> > > Thanks Luke! LGTM.
> > >
> > > On Sun, 22 May 2022 at 05:18, Luke Chen  wrote:
> > >
> > > > Hi Tom and Raman,
> > > >
> > > > Thanks for your comments.
> > > >
> > > > > 1. There's not a JIRA for this KIP (or the JIRA link needs
> updating).
> > > > 2. Similarly the link to this discussion thread needs updating.
> > > > > Please update the links to JIRA and the discussion thread.
> > > >
> > > > Yes, thanks for the reminder. I've updated the KIP.
> > > >
> > > > > 3. I wonder whether we need to keep these metrics (with value 0)
> once
> > > the
> > > > broker enters the running state. Do you see it as valuable? A benefit
> > of
> > > > removing the metrics would be a reduction on storage required for
> > metric
> > > > stores which are recording these metrics.
> > > >
> > > > Yes, removing the metrics after log recovery completed is a good
> idea.
> > > > Updated the KIP.
> > > >
> > > > > 4. I think the KIP's public interfaces section could be a bit
> > clearer.
> > > > Previous KIPs which added metrics usually used a table, with the
> MBean
> > > > name, metric type and description. SeeKIP-551 for example (or
> KIP-748,
> > > > KIP-608). Similarly you could use a table in the proposed changes
> > section
> > > > rather than describing the tree you'd see in an MBean console.
> > > >
> > > > Good point! Updated the KIP to use a table to list the MBean name,
> > metric
> > > > type and descriptions.
> > > >
> > > >
> > > > Thank you.
> > > > Luke
> > > >
> > > > On Fri, May 20, 2022 at 9:13 AM Raman Verma
> >  > > >
> > > > wrote:
> > > >
> > > > > Hi Luke,
> > > > >
> > > > > The change is useful and simple. Thanks.
> > > > > Please update the links to JIRA and the discussion thread.
> > > > >
> > > > > Best Regards,
> > > > > Raman Verma
> > > > >
> > > > > On Thu, May 19, 2022 at 8:57 AM Tom Bentley 
> > > wrote:
> > > > > >
> > > > > > Hi Luke,
> > > > > >
> > > > > > Thanks for the KIP. I think the idea makes sense and would
> provide
> > > > useful
> > > > > > observability of log recovery. I have a few comments.
> > > > > >
> > > > > > 1. There's not a JIRA for this KIP (or the JIRA link needs
> > updating).
> > > > > > 2. Similarly the link to this discussion thread needs updating.
> > > > > > 3. I wonder whether we need to keep these metrics (with value 0)
> > once
> > > > the
> > > > > > broker enters the running state. Do you see it as valuable? A
> > benefit
> > > > of
> > > > > > removing the metrics would be a reduction on storage required for
> > > > metric
> > > > > > stores which are recording these metrics.
> > > > > > 4. I think the KIP's public interfaces section could be a bit
> > > clearer.
> > > > > > Previous KIPs which added metrics usually used a table, with the
> > > MBean
> > > > > > name, metric type and description. SeeKIP-551 for example (or
> > > KIP-748,
> > > > > > KIP-608). Similarly you could use a table in the proposed changes
> > > > section
> > > > > > rather than describing the tree you'd see in an MBean console.
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > On Wed, 11 May 2022 at 09:08, Luke Chen 
> wrote:
> > > > > >
> > > > > > > > And if people start using RemainingLogs and RemainingSegments
> > and
> > > > > then
> > > > > > > REALLY FEEL like they need RemainingBytes, then we can always
> add
> > > it
> > > > > in the
> > > > > > > future.
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-06-02 Thread Luke Chen
Hi Jun,

Thanks for the comment.

Yes, I've tried to work on this way to track the number of remaining
segments, but it will change the design in UnifiedLog, so I only track the
logs number.
Currently, we will load all segments and recover those segments if needed
"during creating UnifiedLog instance". And also get the log offsets here

.
That is, if we want to get all segments to be recovered before running log
recovery, we need to break the logic in UnifiedLog, to create a partial
UnifiedLog instance, and add more info to it later, which I think is just
making the codes more complicated.


Thank you.
Luke



On Thu, May 26, 2022 at 2:57 AM Jun Rao  wrote:

> Hi, Luke,
>
> Thanks for the KIP. Just one comment.
>
> 10. For kafka.log:type=LogManager,name=remainingLogsToRecovery, could we
> instead track the number of remaining segments? This monitors the progress
> at a finer granularity and is also consistent with the thread level metric.
>
> Thanks,
>
> Jun
>
> On Wed, May 25, 2022 at 7:47 AM Tom Bentley  wrote:
>
> > Thanks Luke! LGTM.
> >
> > On Sun, 22 May 2022 at 05:18, Luke Chen  wrote:
> >
> > > Hi Tom and Raman,
> > >
> > > Thanks for your comments.
> > >
> > > > 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
> > > 2. Similarly the link to this discussion thread needs updating.
> > > > Please update the links to JIRA and the discussion thread.
> > >
> > > Yes, thanks for the reminder. I've updated the KIP.
> > >
> > > > 3. I wonder whether we need to keep these metrics (with value 0) once
> > the
> > > broker enters the running state. Do you see it as valuable? A benefit
> of
> > > removing the metrics would be a reduction on storage required for
> metric
> > > stores which are recording these metrics.
> > >
> > > Yes, removing the metrics after log recovery completed is a good idea.
> > > Updated the KIP.
> > >
> > > > 4. I think the KIP's public interfaces section could be a bit
> clearer.
> > > Previous KIPs which added metrics usually used a table, with the MBean
> > > name, metric type and description. SeeKIP-551 for example (or KIP-748,
> > > KIP-608). Similarly you could use a table in the proposed changes
> section
> > > rather than describing the tree you'd see in an MBean console.
> > >
> > > Good point! Updated the KIP to use a table to list the MBean name,
> metric
> > > type and descriptions.
> > >
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Fri, May 20, 2022 at 9:13 AM Raman Verma
>  > >
> > > wrote:
> > >
> > > > Hi Luke,
> > > >
> > > > The change is useful and simple. Thanks.
> > > > Please update the links to JIRA and the discussion thread.
> > > >
> > > > Best Regards,
> > > > Raman Verma
> > > >
> > > > On Thu, May 19, 2022 at 8:57 AM Tom Bentley 
> > wrote:
> > > > >
> > > > > Hi Luke,
> > > > >
> > > > > Thanks for the KIP. I think the idea makes sense and would provide
> > > useful
> > > > > observability of log recovery. I have a few comments.
> > > > >
> > > > > 1. There's not a JIRA for this KIP (or the JIRA link needs
> updating).
> > > > > 2. Similarly the link to this discussion thread needs updating.
> > > > > 3. I wonder whether we need to keep these metrics (with value 0)
> once
> > > the
> > > > > broker enters the running state. Do you see it as valuable? A
> benefit
> > > of
> > > > > removing the metrics would be a reduction on storage required for
> > > metric
> > > > > stores which are recording these metrics.
> > > > > 4. I think the KIP's public interfaces section could be a bit
> > clearer.
> > > > > Previous KIPs which added metrics usually used a table, with the
> > MBean
> > > > > name, metric type and description. SeeKIP-551 for example (or
> > KIP-748,
> > > > > KIP-608). Similarly you could use a table in the proposed changes
> > > section
> > > > > rather than describing the tree you'd see in an MBean console.
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Tom
> > > > >
> > > > > On Wed, 11 May 2022 at 09:08, Luke Chen  wrote:
> > > > >
> > > > > > > And if people start using RemainingLogs and RemainingSegments
> and
> > > > then
> > > > > > REALLY FEEL like they need RemainingBytes, then we can always add
> > it
> > > > in the
> > > > > > future.
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > Thanks James!
> > > > > > Luke
> > > > > >
> > > > > > On Wed, May 11, 2022 at 3:57 PM James Cheng <
> wushuja...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Luke,
> > > > > > >
> > > > > > > Thanks for the detailed explanation. I agree that the current
> > > > proposal of
> > > > > > > RemainingLogs and RemainingSegments will greatly improve the
> > > > situation,
> > > > > > and
> > > > > > > that we can go ahead with the KIP as is.
> > > > > > >
> > > > > > > If RemainingBytes were straight-forward to implement, then I’d
> > like
> > > > to
> > > > > > > have it. But we can live without it for now. And if 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-25 Thread Jun Rao
Hi, Luke,

Thanks for the KIP. Just one comment.

10. For kafka.log:type=LogManager,name=remainingLogsToRecovery, could we
instead track the number of remaining segments? This monitors the progress
at a finer granularity and is also consistent with the thread level metric.

Thanks,

Jun

On Wed, May 25, 2022 at 7:47 AM Tom Bentley  wrote:

> Thanks Luke! LGTM.
>
> On Sun, 22 May 2022 at 05:18, Luke Chen  wrote:
>
> > Hi Tom and Raman,
> >
> > Thanks for your comments.
> >
> > > 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
> > 2. Similarly the link to this discussion thread needs updating.
> > > Please update the links to JIRA and the discussion thread.
> >
> > Yes, thanks for the reminder. I've updated the KIP.
> >
> > > 3. I wonder whether we need to keep these metrics (with value 0) once
> the
> > broker enters the running state. Do you see it as valuable? A benefit of
> > removing the metrics would be a reduction on storage required for metric
> > stores which are recording these metrics.
> >
> > Yes, removing the metrics after log recovery completed is a good idea.
> > Updated the KIP.
> >
> > > 4. I think the KIP's public interfaces section could be a bit clearer.
> > Previous KIPs which added metrics usually used a table, with the MBean
> > name, metric type and description. SeeKIP-551 for example (or KIP-748,
> > KIP-608). Similarly you could use a table in the proposed changes section
> > rather than describing the tree you'd see in an MBean console.
> >
> > Good point! Updated the KIP to use a table to list the MBean name, metric
> > type and descriptions.
> >
> >
> > Thank you.
> > Luke
> >
> > On Fri, May 20, 2022 at 9:13 AM Raman Verma  >
> > wrote:
> >
> > > Hi Luke,
> > >
> > > The change is useful and simple. Thanks.
> > > Please update the links to JIRA and the discussion thread.
> > >
> > > Best Regards,
> > > Raman Verma
> > >
> > > On Thu, May 19, 2022 at 8:57 AM Tom Bentley 
> wrote:
> > > >
> > > > Hi Luke,
> > > >
> > > > Thanks for the KIP. I think the idea makes sense and would provide
> > useful
> > > > observability of log recovery. I have a few comments.
> > > >
> > > > 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
> > > > 2. Similarly the link to this discussion thread needs updating.
> > > > 3. I wonder whether we need to keep these metrics (with value 0) once
> > the
> > > > broker enters the running state. Do you see it as valuable? A benefit
> > of
> > > > removing the metrics would be a reduction on storage required for
> > metric
> > > > stores which are recording these metrics.
> > > > 4. I think the KIP's public interfaces section could be a bit
> clearer.
> > > > Previous KIPs which added metrics usually used a table, with the
> MBean
> > > > name, metric type and description. SeeKIP-551 for example (or
> KIP-748,
> > > > KIP-608). Similarly you could use a table in the proposed changes
> > section
> > > > rather than describing the tree you'd see in an MBean console.
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Wed, 11 May 2022 at 09:08, Luke Chen  wrote:
> > > >
> > > > > > And if people start using RemainingLogs and RemainingSegments and
> > > then
> > > > > REALLY FEEL like they need RemainingBytes, then we can always add
> it
> > > in the
> > > > > future.
> > > > >
> > > > > +1
> > > > >
> > > > > Thanks James!
> > > > > Luke
> > > > >
> > > > > On Wed, May 11, 2022 at 3:57 PM James Cheng 
> > > wrote:
> > > > >
> > > > > > Hi Luke,
> > > > > >
> > > > > > Thanks for the detailed explanation. I agree that the current
> > > proposal of
> > > > > > RemainingLogs and RemainingSegments will greatly improve the
> > > situation,
> > > > > and
> > > > > > that we can go ahead with the KIP as is.
> > > > > >
> > > > > > If RemainingBytes were straight-forward to implement, then I’d
> like
> > > to
> > > > > > have it. But we can live without it for now. And if people start
> > > using
> > > > > > RemainingLogs and RemainingSegments and then REALLY FEEL like
> they
> > > need
> > > > > > RemainingBytes, then we can always add it in the future.
> > > > > >
> > > > > > Thanks Luke, for the detailed explanation, and for responding to
> my
> > > > > > feedback!
> > > > > >
> > > > > > -James
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > > > On May 10, 2022, at 6:48 AM, Luke Chen 
> > wrote:
> > > > > > >
> > > > > > > Hi James and all,
> > > > > > >
> > > > > > > I checked again and I can see when creating UnifiedLog, we
> > > expected the
> > > > > > > logs/indexes/snapshots are in good state.
> > > > > > > So, I don't think we should break the current design to expose
> > the
> > > > > > > `RemainingBytesToRecovery`
> > > > > > > metric.
> > > > > > >
> > > > > > > If there is no other comments, I'll start a vote within this
> > week.
> > > > > > >
> > > > > > > Thank you.
> > > > > > > Luke
> > > > > > >
> > > > > > >> On Fri, May 6, 2022 at 6:00 PM Luke Chen 
> > > wrote:
> > > > 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-25 Thread Tom Bentley
Thanks Luke! LGTM.

On Sun, 22 May 2022 at 05:18, Luke Chen  wrote:

> Hi Tom and Raman,
>
> Thanks for your comments.
>
> > 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
> 2. Similarly the link to this discussion thread needs updating.
> > Please update the links to JIRA and the discussion thread.
>
> Yes, thanks for the reminder. I've updated the KIP.
>
> > 3. I wonder whether we need to keep these metrics (with value 0) once the
> broker enters the running state. Do you see it as valuable? A benefit of
> removing the metrics would be a reduction on storage required for metric
> stores which are recording these metrics.
>
> Yes, removing the metrics after log recovery completed is a good idea.
> Updated the KIP.
>
> > 4. I think the KIP's public interfaces section could be a bit clearer.
> Previous KIPs which added metrics usually used a table, with the MBean
> name, metric type and description. SeeKIP-551 for example (or KIP-748,
> KIP-608). Similarly you could use a table in the proposed changes section
> rather than describing the tree you'd see in an MBean console.
>
> Good point! Updated the KIP to use a table to list the MBean name, metric
> type and descriptions.
>
>
> Thank you.
> Luke
>
> On Fri, May 20, 2022 at 9:13 AM Raman Verma 
> wrote:
>
> > Hi Luke,
> >
> > The change is useful and simple. Thanks.
> > Please update the links to JIRA and the discussion thread.
> >
> > Best Regards,
> > Raman Verma
> >
> > On Thu, May 19, 2022 at 8:57 AM Tom Bentley  wrote:
> > >
> > > Hi Luke,
> > >
> > > Thanks for the KIP. I think the idea makes sense and would provide
> useful
> > > observability of log recovery. I have a few comments.
> > >
> > > 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
> > > 2. Similarly the link to this discussion thread needs updating.
> > > 3. I wonder whether we need to keep these metrics (with value 0) once
> the
> > > broker enters the running state. Do you see it as valuable? A benefit
> of
> > > removing the metrics would be a reduction on storage required for
> metric
> > > stores which are recording these metrics.
> > > 4. I think the KIP's public interfaces section could be a bit clearer.
> > > Previous KIPs which added metrics usually used a table, with the MBean
> > > name, metric type and description. SeeKIP-551 for example (or KIP-748,
> > > KIP-608). Similarly you could use a table in the proposed changes
> section
> > > rather than describing the tree you'd see in an MBean console.
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > > On Wed, 11 May 2022 at 09:08, Luke Chen  wrote:
> > >
> > > > > And if people start using RemainingLogs and RemainingSegments and
> > then
> > > > REALLY FEEL like they need RemainingBytes, then we can always add it
> > in the
> > > > future.
> > > >
> > > > +1
> > > >
> > > > Thanks James!
> > > > Luke
> > > >
> > > > On Wed, May 11, 2022 at 3:57 PM James Cheng 
> > wrote:
> > > >
> > > > > Hi Luke,
> > > > >
> > > > > Thanks for the detailed explanation. I agree that the current
> > proposal of
> > > > > RemainingLogs and RemainingSegments will greatly improve the
> > situation,
> > > > and
> > > > > that we can go ahead with the KIP as is.
> > > > >
> > > > > If RemainingBytes were straight-forward to implement, then I’d like
> > to
> > > > > have it. But we can live without it for now. And if people start
> > using
> > > > > RemainingLogs and RemainingSegments and then REALLY FEEL like they
> > need
> > > > > RemainingBytes, then we can always add it in the future.
> > > > >
> > > > > Thanks Luke, for the detailed explanation, and for responding to my
> > > > > feedback!
> > > > >
> > > > > -James
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On May 10, 2022, at 6:48 AM, Luke Chen 
> wrote:
> > > > > >
> > > > > > Hi James and all,
> > > > > >
> > > > > > I checked again and I can see when creating UnifiedLog, we
> > expected the
> > > > > > logs/indexes/snapshots are in good state.
> > > > > > So, I don't think we should break the current design to expose
> the
> > > > > > `RemainingBytesToRecovery`
> > > > > > metric.
> > > > > >
> > > > > > If there is no other comments, I'll start a vote within this
> week.
> > > > > >
> > > > > > Thank you.
> > > > > > Luke
> > > > > >
> > > > > >> On Fri, May 6, 2022 at 6:00 PM Luke Chen 
> > wrote:
> > > > > >>
> > > > > >> Hi James,
> > > > > >>
> > > > > >> Thanks for your input.
> > > > > >>
> > > > > >> For the `RemainingBytesToRecovery` metric proposal, I think
> > there's
> > > > one
> > > > > >> thing I didn't make it clear.
> > > > > >> Currently, when log manager start up, we'll try to load all logs
> > > > > >> (segments), and during the log loading, we'll try to recover
> logs
> > if
> > > > > >> necessary.
> > > > > >> And the logs loading is using "thread pool" as you thought.
> > > > > >>
> > > > > >> So, here's the problem:
> > > > > >> All segments in each log folder (partition) will be loaded in
> > each 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-21 Thread Luke Chen
Hi Tom and Raman,

Thanks for your comments.

> 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
2. Similarly the link to this discussion thread needs updating.
> Please update the links to JIRA and the discussion thread.

Yes, thanks for the reminder. I've updated the KIP.

> 3. I wonder whether we need to keep these metrics (with value 0) once the
broker enters the running state. Do you see it as valuable? A benefit of
removing the metrics would be a reduction on storage required for metric
stores which are recording these metrics.

Yes, removing the metrics after log recovery completed is a good idea.
Updated the KIP.

> 4. I think the KIP's public interfaces section could be a bit clearer.
Previous KIPs which added metrics usually used a table, with the MBean
name, metric type and description. SeeKIP-551 for example (or KIP-748,
KIP-608). Similarly you could use a table in the proposed changes section
rather than describing the tree you'd see in an MBean console.

Good point! Updated the KIP to use a table to list the MBean name, metric
type and descriptions.


Thank you.
Luke

On Fri, May 20, 2022 at 9:13 AM Raman Verma 
wrote:

> Hi Luke,
>
> The change is useful and simple. Thanks.
> Please update the links to JIRA and the discussion thread.
>
> Best Regards,
> Raman Verma
>
> On Thu, May 19, 2022 at 8:57 AM Tom Bentley  wrote:
> >
> > Hi Luke,
> >
> > Thanks for the KIP. I think the idea makes sense and would provide useful
> > observability of log recovery. I have a few comments.
> >
> > 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
> > 2. Similarly the link to this discussion thread needs updating.
> > 3. I wonder whether we need to keep these metrics (with value 0) once the
> > broker enters the running state. Do you see it as valuable? A benefit of
> > removing the metrics would be a reduction on storage required for metric
> > stores which are recording these metrics.
> > 4. I think the KIP's public interfaces section could be a bit clearer.
> > Previous KIPs which added metrics usually used a table, with the MBean
> > name, metric type and description. SeeKIP-551 for example (or KIP-748,
> > KIP-608). Similarly you could use a table in the proposed changes section
> > rather than describing the tree you'd see in an MBean console.
> >
> > Kind regards,
> >
> > Tom
> >
> > On Wed, 11 May 2022 at 09:08, Luke Chen  wrote:
> >
> > > > And if people start using RemainingLogs and RemainingSegments and
> then
> > > REALLY FEEL like they need RemainingBytes, then we can always add it
> in the
> > > future.
> > >
> > > +1
> > >
> > > Thanks James!
> > > Luke
> > >
> > > On Wed, May 11, 2022 at 3:57 PM James Cheng 
> wrote:
> > >
> > > > Hi Luke,
> > > >
> > > > Thanks for the detailed explanation. I agree that the current
> proposal of
> > > > RemainingLogs and RemainingSegments will greatly improve the
> situation,
> > > and
> > > > that we can go ahead with the KIP as is.
> > > >
> > > > If RemainingBytes were straight-forward to implement, then I’d like
> to
> > > > have it. But we can live without it for now. And if people start
> using
> > > > RemainingLogs and RemainingSegments and then REALLY FEEL like they
> need
> > > > RemainingBytes, then we can always add it in the future.
> > > >
> > > > Thanks Luke, for the detailed explanation, and for responding to my
> > > > feedback!
> > > >
> > > > -James
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On May 10, 2022, at 6:48 AM, Luke Chen  wrote:
> > > > >
> > > > > Hi James and all,
> > > > >
> > > > > I checked again and I can see when creating UnifiedLog, we
> expected the
> > > > > logs/indexes/snapshots are in good state.
> > > > > So, I don't think we should break the current design to expose the
> > > > > `RemainingBytesToRecovery`
> > > > > metric.
> > > > >
> > > > > If there is no other comments, I'll start a vote within this week.
> > > > >
> > > > > Thank you.
> > > > > Luke
> > > > >
> > > > >> On Fri, May 6, 2022 at 6:00 PM Luke Chen 
> wrote:
> > > > >>
> > > > >> Hi James,
> > > > >>
> > > > >> Thanks for your input.
> > > > >>
> > > > >> For the `RemainingBytesToRecovery` metric proposal, I think
> there's
> > > one
> > > > >> thing I didn't make it clear.
> > > > >> Currently, when log manager start up, we'll try to load all logs
> > > > >> (segments), and during the log loading, we'll try to recover logs
> if
> > > > >> necessary.
> > > > >> And the logs loading is using "thread pool" as you thought.
> > > > >>
> > > > >> So, here's the problem:
> > > > >> All segments in each log folder (partition) will be loaded in
> each log
> > > > >> recovery thread, and until it's loaded, we can know how many
> segments
> > > > (or
> > > > >> how many Bytes) needed to recover.
> > > > >> That means, if we have 10 partition logs in one broker, and we
> have 2
> > > > log
> > > > >> recovery threads (num.recovery.threads.per.data.dir=2), before the
> > > > >> threads load the segments in 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-19 Thread Raman Verma
Hi Luke,

The change is useful and simple. Thanks.
Please update the links to JIRA and the discussion thread.

Best Regards,
Raman Verma

On Thu, May 19, 2022 at 8:57 AM Tom Bentley  wrote:
>
> Hi Luke,
>
> Thanks for the KIP. I think the idea makes sense and would provide useful
> observability of log recovery. I have a few comments.
>
> 1. There's not a JIRA for this KIP (or the JIRA link needs updating).
> 2. Similarly the link to this discussion thread needs updating.
> 3. I wonder whether we need to keep these metrics (with value 0) once the
> broker enters the running state. Do you see it as valuable? A benefit of
> removing the metrics would be a reduction on storage required for metric
> stores which are recording these metrics.
> 4. I think the KIP's public interfaces section could be a bit clearer.
> Previous KIPs which added metrics usually used a table, with the MBean
> name, metric type and description. SeeKIP-551 for example (or KIP-748,
> KIP-608). Similarly you could use a table in the proposed changes section
> rather than describing the tree you'd see in an MBean console.
>
> Kind regards,
>
> Tom
>
> On Wed, 11 May 2022 at 09:08, Luke Chen  wrote:
>
> > > And if people start using RemainingLogs and RemainingSegments and then
> > REALLY FEEL like they need RemainingBytes, then we can always add it in the
> > future.
> >
> > +1
> >
> > Thanks James!
> > Luke
> >
> > On Wed, May 11, 2022 at 3:57 PM James Cheng  wrote:
> >
> > > Hi Luke,
> > >
> > > Thanks for the detailed explanation. I agree that the current proposal of
> > > RemainingLogs and RemainingSegments will greatly improve the situation,
> > and
> > > that we can go ahead with the KIP as is.
> > >
> > > If RemainingBytes were straight-forward to implement, then I’d like to
> > > have it. But we can live without it for now. And if people start using
> > > RemainingLogs and RemainingSegments and then REALLY FEEL like they need
> > > RemainingBytes, then we can always add it in the future.
> > >
> > > Thanks Luke, for the detailed explanation, and for responding to my
> > > feedback!
> > >
> > > -James
> > >
> > > Sent from my iPhone
> > >
> > > > On May 10, 2022, at 6:48 AM, Luke Chen  wrote:
> > > >
> > > > Hi James and all,
> > > >
> > > > I checked again and I can see when creating UnifiedLog, we expected the
> > > > logs/indexes/snapshots are in good state.
> > > > So, I don't think we should break the current design to expose the
> > > > `RemainingBytesToRecovery`
> > > > metric.
> > > >
> > > > If there is no other comments, I'll start a vote within this week.
> > > >
> > > > Thank you.
> > > > Luke
> > > >
> > > >> On Fri, May 6, 2022 at 6:00 PM Luke Chen  wrote:
> > > >>
> > > >> Hi James,
> > > >>
> > > >> Thanks for your input.
> > > >>
> > > >> For the `RemainingBytesToRecovery` metric proposal, I think there's
> > one
> > > >> thing I didn't make it clear.
> > > >> Currently, when log manager start up, we'll try to load all logs
> > > >> (segments), and during the log loading, we'll try to recover logs if
> > > >> necessary.
> > > >> And the logs loading is using "thread pool" as you thought.
> > > >>
> > > >> So, here's the problem:
> > > >> All segments in each log folder (partition) will be loaded in each log
> > > >> recovery thread, and until it's loaded, we can know how many segments
> > > (or
> > > >> how many Bytes) needed to recover.
> > > >> That means, if we have 10 partition logs in one broker, and we have 2
> > > log
> > > >> recovery threads (num.recovery.threads.per.data.dir=2), before the
> > > >> threads load the segments in each log, we only know how many logs
> > > >> (partitions) we have in the broker (i.e. RemainingLogsToRecover
> > metric).
> > > >> We cannot know how many segments/Bytes needed to recover until each
> > > thread
> > > >> starts to load the segments under one log (partition).
> > > >>
> > > >> So, the example in the KIP, it shows:
> > > >> Currently, there are still 5 logs (partitions) needed to recover under
> > > >> /tmp/log1 dir. And there are 2 threads doing the jobs, where one
> > thread
> > > has
> > > >> 1 segments needed to recover, and the other one has 3 segments
> > > needed
> > > >> to recover.
> > > >>
> > > >>   - kafka.log
> > > >>  - LogManager
> > > >> - RemainingLogsToRecover
> > > >>- /tmp/log1 => 5← there are 5 logs under
> > > >>/tmp/log1 needed to be recovered
> > > >>- /tmp/log2 => 0
> > > >> - RemainingSegmentsToRecover
> > > >>- /tmp/log1 ← 2 threads are doing log
> > > >>recovery for /tmp/log1
> > > >>- 0 => 1 ← there are 1 segments needed to
> > be
> > > >>   recovered for thread 0
> > > >>   - 1 => 3
> > > >>   - /tmp/log2
> > > >>   - 0 => 0
> > > >>   - 1 => 0
> > > >>
> > > >>
> > > >> So, after a while, the metrics might look like this:
> > > >> It 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-19 Thread Tom Bentley
Hi Luke,

Thanks for the KIP. I think the idea makes sense and would provide useful
observability of log recovery. I have a few comments.

1. There's not a JIRA for this KIP (or the JIRA link needs updating).
2. Similarly the link to this discussion thread needs updating.
3. I wonder whether we need to keep these metrics (with value 0) once the
broker enters the running state. Do you see it as valuable? A benefit of
removing the metrics would be a reduction on storage required for metric
stores which are recording these metrics.
4. I think the KIP's public interfaces section could be a bit clearer.
Previous KIPs which added metrics usually used a table, with the MBean
name, metric type and description. SeeKIP-551 for example (or KIP-748,
KIP-608). Similarly you could use a table in the proposed changes section
rather than describing the tree you'd see in an MBean console.

Kind regards,

Tom

On Wed, 11 May 2022 at 09:08, Luke Chen  wrote:

> > And if people start using RemainingLogs and RemainingSegments and then
> REALLY FEEL like they need RemainingBytes, then we can always add it in the
> future.
>
> +1
>
> Thanks James!
> Luke
>
> On Wed, May 11, 2022 at 3:57 PM James Cheng  wrote:
>
> > Hi Luke,
> >
> > Thanks for the detailed explanation. I agree that the current proposal of
> > RemainingLogs and RemainingSegments will greatly improve the situation,
> and
> > that we can go ahead with the KIP as is.
> >
> > If RemainingBytes were straight-forward to implement, then I’d like to
> > have it. But we can live without it for now. And if people start using
> > RemainingLogs and RemainingSegments and then REALLY FEEL like they need
> > RemainingBytes, then we can always add it in the future.
> >
> > Thanks Luke, for the detailed explanation, and for responding to my
> > feedback!
> >
> > -James
> >
> > Sent from my iPhone
> >
> > > On May 10, 2022, at 6:48 AM, Luke Chen  wrote:
> > >
> > > Hi James and all,
> > >
> > > I checked again and I can see when creating UnifiedLog, we expected the
> > > logs/indexes/snapshots are in good state.
> > > So, I don't think we should break the current design to expose the
> > > `RemainingBytesToRecovery`
> > > metric.
> > >
> > > If there is no other comments, I'll start a vote within this week.
> > >
> > > Thank you.
> > > Luke
> > >
> > >> On Fri, May 6, 2022 at 6:00 PM Luke Chen  wrote:
> > >>
> > >> Hi James,
> > >>
> > >> Thanks for your input.
> > >>
> > >> For the `RemainingBytesToRecovery` metric proposal, I think there's
> one
> > >> thing I didn't make it clear.
> > >> Currently, when log manager start up, we'll try to load all logs
> > >> (segments), and during the log loading, we'll try to recover logs if
> > >> necessary.
> > >> And the logs loading is using "thread pool" as you thought.
> > >>
> > >> So, here's the problem:
> > >> All segments in each log folder (partition) will be loaded in each log
> > >> recovery thread, and until it's loaded, we can know how many segments
> > (or
> > >> how many Bytes) needed to recover.
> > >> That means, if we have 10 partition logs in one broker, and we have 2
> > log
> > >> recovery threads (num.recovery.threads.per.data.dir=2), before the
> > >> threads load the segments in each log, we only know how many logs
> > >> (partitions) we have in the broker (i.e. RemainingLogsToRecover
> metric).
> > >> We cannot know how many segments/Bytes needed to recover until each
> > thread
> > >> starts to load the segments under one log (partition).
> > >>
> > >> So, the example in the KIP, it shows:
> > >> Currently, there are still 5 logs (partitions) needed to recover under
> > >> /tmp/log1 dir. And there are 2 threads doing the jobs, where one
> thread
> > has
> > >> 1 segments needed to recover, and the other one has 3 segments
> > needed
> > >> to recover.
> > >>
> > >>   - kafka.log
> > >>  - LogManager
> > >> - RemainingLogsToRecover
> > >>- /tmp/log1 => 5← there are 5 logs under
> > >>/tmp/log1 needed to be recovered
> > >>- /tmp/log2 => 0
> > >> - RemainingSegmentsToRecover
> > >>- /tmp/log1 ← 2 threads are doing log
> > >>recovery for /tmp/log1
> > >>- 0 => 1 ← there are 1 segments needed to
> be
> > >>   recovered for thread 0
> > >>   - 1 => 3
> > >>   - /tmp/log2
> > >>   - 0 => 0
> > >>   - 1 => 0
> > >>
> > >>
> > >> So, after a while, the metrics might look like this:
> > >> It said, now, there are only 4 logs needed to recover in /tmp/log1,
> and
> > >> the thread 0 has 9000 segments left, and thread 1 has 5 segments left
> > >> (which should imply the thread already completed 2 logs recovery in
> the
> > >> period)
> > >>
> > >>   - kafka.log
> > >>  - LogManager
> > >> - RemainingLogsToRecover
> > >>- /tmp/log1 => 3← there are 3 logs under
> > >>/tmp/log1 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-11 Thread Luke Chen
> And if people start using RemainingLogs and RemainingSegments and then
REALLY FEEL like they need RemainingBytes, then we can always add it in the
future.

+1

Thanks James!
Luke

On Wed, May 11, 2022 at 3:57 PM James Cheng  wrote:

> Hi Luke,
>
> Thanks for the detailed explanation. I agree that the current proposal of
> RemainingLogs and RemainingSegments will greatly improve the situation, and
> that we can go ahead with the KIP as is.
>
> If RemainingBytes were straight-forward to implement, then I’d like to
> have it. But we can live without it for now. And if people start using
> RemainingLogs and RemainingSegments and then REALLY FEEL like they need
> RemainingBytes, then we can always add it in the future.
>
> Thanks Luke, for the detailed explanation, and for responding to my
> feedback!
>
> -James
>
> Sent from my iPhone
>
> > On May 10, 2022, at 6:48 AM, Luke Chen  wrote:
> >
> > Hi James and all,
> >
> > I checked again and I can see when creating UnifiedLog, we expected the
> > logs/indexes/snapshots are in good state.
> > So, I don't think we should break the current design to expose the
> > `RemainingBytesToRecovery`
> > metric.
> >
> > If there is no other comments, I'll start a vote within this week.
> >
> > Thank you.
> > Luke
> >
> >> On Fri, May 6, 2022 at 6:00 PM Luke Chen  wrote:
> >>
> >> Hi James,
> >>
> >> Thanks for your input.
> >>
> >> For the `RemainingBytesToRecovery` metric proposal, I think there's one
> >> thing I didn't make it clear.
> >> Currently, when log manager start up, we'll try to load all logs
> >> (segments), and during the log loading, we'll try to recover logs if
> >> necessary.
> >> And the logs loading is using "thread pool" as you thought.
> >>
> >> So, here's the problem:
> >> All segments in each log folder (partition) will be loaded in each log
> >> recovery thread, and until it's loaded, we can know how many segments
> (or
> >> how many Bytes) needed to recover.
> >> That means, if we have 10 partition logs in one broker, and we have 2
> log
> >> recovery threads (num.recovery.threads.per.data.dir=2), before the
> >> threads load the segments in each log, we only know how many logs
> >> (partitions) we have in the broker (i.e. RemainingLogsToRecover metric).
> >> We cannot know how many segments/Bytes needed to recover until each
> thread
> >> starts to load the segments under one log (partition).
> >>
> >> So, the example in the KIP, it shows:
> >> Currently, there are still 5 logs (partitions) needed to recover under
> >> /tmp/log1 dir. And there are 2 threads doing the jobs, where one thread
> has
> >> 1 segments needed to recover, and the other one has 3 segments
> needed
> >> to recover.
> >>
> >>   - kafka.log
> >>  - LogManager
> >> - RemainingLogsToRecover
> >>- /tmp/log1 => 5← there are 5 logs under
> >>/tmp/log1 needed to be recovered
> >>- /tmp/log2 => 0
> >> - RemainingSegmentsToRecover
> >>- /tmp/log1 ← 2 threads are doing log
> >>recovery for /tmp/log1
> >>- 0 => 1 ← there are 1 segments needed to be
> >>   recovered for thread 0
> >>   - 1 => 3
> >>   - /tmp/log2
> >>   - 0 => 0
> >>   - 1 => 0
> >>
> >>
> >> So, after a while, the metrics might look like this:
> >> It said, now, there are only 4 logs needed to recover in /tmp/log1, and
> >> the thread 0 has 9000 segments left, and thread 1 has 5 segments left
> >> (which should imply the thread already completed 2 logs recovery in the
> >> period)
> >>
> >>   - kafka.log
> >>  - LogManager
> >> - RemainingLogsToRecover
> >>- /tmp/log1 => 3← there are 3 logs under
> >>/tmp/log1 needed to be recovered
> >>- /tmp/log2 => 0
> >> - RemainingSegmentsToRecover
> >>- /tmp/log1 ← 2 threads are doing log
> >>recovery for /tmp/log1
> >>- 0 => 9000 ← there are 9000 segments needed to be
> >>   recovered for thread 0
> >>   - 1 => 5
> >>   - /tmp/log2
> >>   - 0 => 0
> >>   - 1 => 0
> >>
> >>
> >> That said, the `RemainingBytesToRecovery` metric is difficult to achieve
> >> as you expected. I think the current proposal with
> `RemainingLogsToRecover`
> >> and `RemainingSegmentsToRecover` should already provide enough info for
> >> the log recovery progress.
> >>
> >> I've also updated the KIP example to make it clear.
> >>
> >>
> >> Thank you.
> >> Luke
> >>
> >>
> >>> On Thu, May 5, 2022 at 3:31 AM James Cheng 
> wrote:
> >>>
> >>> Hi Luke,
> >>>
> >>> Thanks for adding RemainingSegmentsToRecovery.
> >>>
> >>> Another thought: different topics can have different segment sizes. I
> >>> don't know how common it is, but it is possible. Some topics might want
> >>> small segment sizes to more granular 

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-11 Thread James Cheng
Hi Luke,

Thanks for the detailed explanation. I agree that the current proposal of 
RemainingLogs and RemainingSegments will greatly improve the situation, and 
that we can go ahead with the KIP as is.

If RemainingBytes were straight-forward to implement, then I’d like to have it. 
But we can live without it for now. And if people start using RemainingLogs and 
RemainingSegments and then REALLY FEEL like they need RemainingBytes, then we 
can always add it in the future.

Thanks Luke, for the detailed explanation, and for responding to my feedback!

-James

Sent from my iPhone

> On May 10, 2022, at 6:48 AM, Luke Chen  wrote:
> 
> Hi James and all,
> 
> I checked again and I can see when creating UnifiedLog, we expected the
> logs/indexes/snapshots are in good state.
> So, I don't think we should break the current design to expose the
> `RemainingBytesToRecovery`
> metric.
> 
> If there is no other comments, I'll start a vote within this week.
> 
> Thank you.
> Luke
> 
>> On Fri, May 6, 2022 at 6:00 PM Luke Chen  wrote:
>> 
>> Hi James,
>> 
>> Thanks for your input.
>> 
>> For the `RemainingBytesToRecovery` metric proposal, I think there's one
>> thing I didn't make it clear.
>> Currently, when log manager start up, we'll try to load all logs
>> (segments), and during the log loading, we'll try to recover logs if
>> necessary.
>> And the logs loading is using "thread pool" as you thought.
>> 
>> So, here's the problem:
>> All segments in each log folder (partition) will be loaded in each log
>> recovery thread, and until it's loaded, we can know how many segments (or
>> how many Bytes) needed to recover.
>> That means, if we have 10 partition logs in one broker, and we have 2 log
>> recovery threads (num.recovery.threads.per.data.dir=2), before the
>> threads load the segments in each log, we only know how many logs
>> (partitions) we have in the broker (i.e. RemainingLogsToRecover metric).
>> We cannot know how many segments/Bytes needed to recover until each thread
>> starts to load the segments under one log (partition).
>> 
>> So, the example in the KIP, it shows:
>> Currently, there are still 5 logs (partitions) needed to recover under
>> /tmp/log1 dir. And there are 2 threads doing the jobs, where one thread has
>> 1 segments needed to recover, and the other one has 3 segments needed
>> to recover.
>> 
>>   - kafka.log
>>  - LogManager
>> - RemainingLogsToRecover
>>- /tmp/log1 => 5← there are 5 logs under
>>/tmp/log1 needed to be recovered
>>- /tmp/log2 => 0
>> - RemainingSegmentsToRecover
>>- /tmp/log1 ← 2 threads are doing log
>>recovery for /tmp/log1
>>- 0 => 1 ← there are 1 segments needed to be
>>   recovered for thread 0
>>   - 1 => 3
>>   - /tmp/log2
>>   - 0 => 0
>>   - 1 => 0
>> 
>> 
>> So, after a while, the metrics might look like this:
>> It said, now, there are only 4 logs needed to recover in /tmp/log1, and
>> the thread 0 has 9000 segments left, and thread 1 has 5 segments left
>> (which should imply the thread already completed 2 logs recovery in the
>> period)
>> 
>>   - kafka.log
>>  - LogManager
>> - RemainingLogsToRecover
>>- /tmp/log1 => 3← there are 3 logs under
>>/tmp/log1 needed to be recovered
>>- /tmp/log2 => 0
>> - RemainingSegmentsToRecover
>>- /tmp/log1 ← 2 threads are doing log
>>recovery for /tmp/log1
>>- 0 => 9000 ← there are 9000 segments needed to be
>>   recovered for thread 0
>>   - 1 => 5
>>   - /tmp/log2
>>   - 0 => 0
>>   - 1 => 0
>> 
>> 
>> That said, the `RemainingBytesToRecovery` metric is difficult to achieve
>> as you expected. I think the current proposal with `RemainingLogsToRecover`
>> and `RemainingSegmentsToRecover` should already provide enough info for
>> the log recovery progress.
>> 
>> I've also updated the KIP example to make it clear.
>> 
>> 
>> Thank you.
>> Luke
>> 
>> 
>>> On Thu, May 5, 2022 at 3:31 AM James Cheng  wrote:
>>> 
>>> Hi Luke,
>>> 
>>> Thanks for adding RemainingSegmentsToRecovery.
>>> 
>>> Another thought: different topics can have different segment sizes. I
>>> don't know how common it is, but it is possible. Some topics might want
>>> small segment sizes to more granular expiration of data.
>>> 
>>> The downside of RemainingLogsToRecovery and RemainingSegmentsToRecovery
>>> is that the rate that they will decrement depends on the configuration and
>>> patterns of the topics and partitions and segment sizes. If someone is
>>> monitoring those metrics, they might see times where the metric decrements
>>> slowly, followed by a burst where it decrements quickly.
>>> 
>>> What about RemainingBytesToRecovery? This would not depend on the

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-10 Thread Luke Chen
Hi James and all,

I checked again and I can see when creating UnifiedLog, we expected the
logs/indexes/snapshots are in good state.
So, I don't think we should break the current design to expose the
`RemainingBytesToRecovery`
metric.

If there is no other comments, I'll start a vote within this week.

Thank you.
Luke

On Fri, May 6, 2022 at 6:00 PM Luke Chen  wrote:

> Hi James,
>
> Thanks for your input.
>
> For the `RemainingBytesToRecovery` metric proposal, I think there's one
> thing I didn't make it clear.
> Currently, when log manager start up, we'll try to load all logs
> (segments), and during the log loading, we'll try to recover logs if
> necessary.
> And the logs loading is using "thread pool" as you thought.
>
> So, here's the problem:
> All segments in each log folder (partition) will be loaded in each log
> recovery thread, and until it's loaded, we can know how many segments (or
> how many Bytes) needed to recover.
> That means, if we have 10 partition logs in one broker, and we have 2 log
> recovery threads (num.recovery.threads.per.data.dir=2), before the
> threads load the segments in each log, we only know how many logs
> (partitions) we have in the broker (i.e. RemainingLogsToRecover metric).
> We cannot know how many segments/Bytes needed to recover until each thread
> starts to load the segments under one log (partition).
>
> So, the example in the KIP, it shows:
> Currently, there are still 5 logs (partitions) needed to recover under
> /tmp/log1 dir. And there are 2 threads doing the jobs, where one thread has
> 1 segments needed to recover, and the other one has 3 segments needed
> to recover.
>
>- kafka.log
>   - LogManager
>  - RemainingLogsToRecover
> - /tmp/log1 => 5← there are 5 logs under
> /tmp/log1 needed to be recovered
> - /tmp/log2 => 0
>  - RemainingSegmentsToRecover
> - /tmp/log1 ← 2 threads are doing log
> recovery for /tmp/log1
> - 0 => 1 ← there are 1 segments needed to be
>recovered for thread 0
>- 1 => 3
>- /tmp/log2
>- 0 => 0
>- 1 => 0
>
>
> So, after a while, the metrics might look like this:
> It said, now, there are only 4 logs needed to recover in /tmp/log1, and
> the thread 0 has 9000 segments left, and thread 1 has 5 segments left
> (which should imply the thread already completed 2 logs recovery in the
> period)
>
>- kafka.log
>   - LogManager
>  - RemainingLogsToRecover
> - /tmp/log1 => 3← there are 3 logs under
> /tmp/log1 needed to be recovered
> - /tmp/log2 => 0
>  - RemainingSegmentsToRecover
> - /tmp/log1 ← 2 threads are doing log
> recovery for /tmp/log1
> - 0 => 9000 ← there are 9000 segments needed to be
>recovered for thread 0
>- 1 => 5
>- /tmp/log2
>- 0 => 0
>- 1 => 0
>
>
> That said, the `RemainingBytesToRecovery` metric is difficult to achieve
> as you expected. I think the current proposal with `RemainingLogsToRecover`
> and `RemainingSegmentsToRecover` should already provide enough info for
> the log recovery progress.
>
> I've also updated the KIP example to make it clear.
>
>
> Thank you.
> Luke
>
>
> On Thu, May 5, 2022 at 3:31 AM James Cheng  wrote:
>
>> Hi Luke,
>>
>> Thanks for adding RemainingSegmentsToRecovery.
>>
>> Another thought: different topics can have different segment sizes. I
>> don't know how common it is, but it is possible. Some topics might want
>> small segment sizes to more granular expiration of data.
>>
>> The downside of RemainingLogsToRecovery and RemainingSegmentsToRecovery
>> is that the rate that they will decrement depends on the configuration and
>> patterns of the topics and partitions and segment sizes. If someone is
>> monitoring those metrics, they might see times where the metric decrements
>> slowly, followed by a burst where it decrements quickly.
>>
>> What about RemainingBytesToRecovery? This would not depend on the
>> configuration of the topic or of the data. It would actually be a pretty
>> good metric, because I think that this metric would change at a constant
>> rate (based on the disk I/O speed that the broker allocates to recovery).
>> Because it changes at a constant rate, you would be able to use the
>> rate-of-change to predict when it hits zero, which will let you know when
>> the broker is going to start up. Like, I would imagine if we graphed
>> RemainingBytesToRecovery that we'd see a fairly straight line that is
>> decrementing at a steady rate towards zero.
>>
>> What do you think about adding RemainingBytesToRecovery?
>>
>> Or, what would you think about making the primary metric be
>> RemainingBytesToRecovery, and getting rid of the others?
>>

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-06 Thread Luke Chen
Hi James,

Thanks for your input.

For the `RemainingBytesToRecovery` metric proposal, I think there's one
thing I didn't make it clear.
Currently, when log manager start up, we'll try to load all logs
(segments), and during the log loading, we'll try to recover logs if
necessary.
And the logs loading is using "thread pool" as you thought.

So, here's the problem:
All segments in each log folder (partition) will be loaded in each log
recovery thread, and until it's loaded, we can know how many segments (or
how many Bytes) needed to recover.
That means, if we have 10 partition logs in one broker, and we have 2 log
recovery threads (num.recovery.threads.per.data.dir=2), before the threads
load the segments in each log, we only know how many logs (partitions) we
have in the broker (i.e. RemainingLogsToRecover metric). We cannot know how
many segments/Bytes needed to recover until each thread starts to load the
segments under one log (partition).

So, the example in the KIP, it shows:
Currently, there are still 5 logs (partitions) needed to recover under
/tmp/log1 dir. And there are 2 threads doing the jobs, where one thread has
1 segments needed to recover, and the other one has 3 segments needed
to recover.

   - kafka.log
  - LogManager
 - RemainingLogsToRecover
- /tmp/log1 => 5← there are 5 logs under /tmp/log1
needed to be recovered
- /tmp/log2 => 0
 - RemainingSegmentsToRecover
- /tmp/log1 ← 2 threads are doing log
recovery for /tmp/log1
- 0 => 1 ← there are 1 segments needed to be
   recovered for thread 0
   - 1 => 3
   - /tmp/log2
   - 0 => 0
   - 1 => 0


So, after a while, the metrics might look like this:
It said, now, there are only 4 logs needed to recover in /tmp/log1, and the
thread 0 has 9000 segments left, and thread 1 has 5 segments left (which
should imply the thread already completed 2 logs recovery in the period)

   - kafka.log
  - LogManager
 - RemainingLogsToRecover
- /tmp/log1 => 3← there are 3 logs under /tmp/log1
needed to be recovered
- /tmp/log2 => 0
 - RemainingSegmentsToRecover
- /tmp/log1 ← 2 threads are doing log
recovery for /tmp/log1
- 0 => 9000 ← there are 9000 segments needed to be
   recovered for thread 0
   - 1 => 5
   - /tmp/log2
   - 0 => 0
   - 1 => 0


That said, the `RemainingBytesToRecovery` metric is difficult to achieve as
you expected. I think the current proposal with `RemainingLogsToRecover`
and `RemainingSegmentsToRecover` should already provide enough info for the
log recovery progress.

I've also updated the KIP example to make it clear.


Thank you.
Luke


On Thu, May 5, 2022 at 3:31 AM James Cheng  wrote:

> Hi Luke,
>
> Thanks for adding RemainingSegmentsToRecovery.
>
> Another thought: different topics can have different segment sizes. I
> don't know how common it is, but it is possible. Some topics might want
> small segment sizes to more granular expiration of data.
>
> The downside of RemainingLogsToRecovery and RemainingSegmentsToRecovery is
> that the rate that they will decrement depends on the configuration and
> patterns of the topics and partitions and segment sizes. If someone is
> monitoring those metrics, they might see times where the metric decrements
> slowly, followed by a burst where it decrements quickly.
>
> What about RemainingBytesToRecovery? This would not depend on the
> configuration of the topic or of the data. It would actually be a pretty
> good metric, because I think that this metric would change at a constant
> rate (based on the disk I/O speed that the broker allocates to recovery).
> Because it changes at a constant rate, you would be able to use the
> rate-of-change to predict when it hits zero, which will let you know when
> the broker is going to start up. Like, I would imagine if we graphed
> RemainingBytesToRecovery that we'd see a fairly straight line that is
> decrementing at a steady rate towards zero.
>
> What do you think about adding RemainingBytesToRecovery?
>
> Or, what would you think about making the primary metric be
> RemainingBytesToRecovery, and getting rid of the others?
>
> I don't know if I personally would rather have all 3 metrics, or would
> just use RemainingBytesToRecovery. I'd too would like more community input
> on which of those metrics would be useful to people.
>
> About the JMX metrics, you said that if
> num.recovery.threads.per.data.dir=2, that there might be a separate
> RemainingSegmentsToRecovery counter for each thread. Is that actually how
> the data is structured within the Kafka recovery threads? Does each thread
> get a fixed set of partitions, or is there just one big pool of partitions

Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-04 Thread James Cheng
Hi Luke,

Thanks for adding RemainingSegmentsToRecovery.

Another thought: different topics can have different segment sizes. I don't 
know how common it is, but it is possible. Some topics might want small segment 
sizes to more granular expiration of data.

The downside of RemainingLogsToRecovery and RemainingSegmentsToRecovery is that 
the rate that they will decrement depends on the configuration and patterns of 
the topics and partitions and segment sizes. If someone is monitoring those 
metrics, they might see times where the metric decrements slowly, followed by a 
burst where it decrements quickly.

What about RemainingBytesToRecovery? This would not depend on the configuration 
of the topic or of the data. It would actually be a pretty good metric, because 
I think that this metric would change at a constant rate (based on the disk I/O 
speed that the broker allocates to recovery). Because it changes at a constant 
rate, you would be able to use the rate-of-change to predict when it hits zero, 
which will let you know when the broker is going to start up. Like, I would 
imagine if we graphed RemainingBytesToRecovery that we'd see a fairly straight 
line that is decrementing at a steady rate towards zero.

What do you think about adding RemainingBytesToRecovery? 

Or, what would you think about making the primary metric be 
RemainingBytesToRecovery, and getting rid of the others?

I don't know if I personally would rather have all 3 metrics, or would just use 
RemainingBytesToRecovery. I'd too would like more community input on which of 
those metrics would be useful to people.

About the JMX metrics, you said that if num.recovery.threads.per.data.dir=2, 
that there might be a separate RemainingSegmentsToRecovery counter for each 
thread. Is that actually how the data is structured within the Kafka recovery 
threads? Does each thread get a fixed set of partitions, or is there just one 
big pool of partitions that the threads all work on?

As a more concrete example:
* If I have 9 small partitions and 1 big partition, and 
num.recovery.threads.per.data.dir=2
Does each thread get 5 partitions, which means one thread will finish much 
sooner than the other?
OR
Do both threads just work on the set of 10 partitions, which means likely 1 
thread will be busy with the big partition, while the other one ends up plowing 
through the 9 small partitions?

If each thread gets assigned 5 partitions, then it would make sense that each 
thread has its own counter.
If the threads works on a single pool of 10 partitions, then it would probably 
mean that the counter is on the pool of partitions itself, and not on each 
thread.

-James

> On May 4, 2022, at 5:55 AM, Luke Chen  wrote:
> 
> Hi devs,
> 
> If there are no other comments, I'll start a vote tomorrow.
> 
> Thank you.
> Luke
> 
> On Sun, May 1, 2022 at 5:08 PM Luke Chen  wrote:
> 
>> Hi James,
>> 
>> Sorry for the late reply.
>> 
>> Yes, this is a good point, to know how many segments to be recovered if
>> there are some large partitions.
>> I've updated the KIP, to add a `*RemainingSegmentsToRecover*` metric for
>> each log recovery thread, to show the value.
>> The example in the Proposed section here
>> 
>> shows what it will look like.
>> 
>> Thanks for the suggestion.
>> 
>> Thank you.
>> Luke
>> 
>> 
>> 
>> On Sat, Apr 23, 2022 at 8:54 AM James Cheng  wrote:
>> 
>>> The KIP describes RemainingLogsToRecovery, which seems to be the number
>>> of partitions in each log.dir.
>>> 
>>> We have some partitions which are much much larger than others. Those
>>> large partitions have many many more segments than others.
>>> 
>>> Is there a way the metric can reflect partition size? Could it be
>>> RemainingSegmentsToRecover? Or even RemainingBytesToRecover?
>>> 
>>> -James
>>> 
>>> Sent from my iPhone
>>> 
 On Apr 20, 2022, at 2:01 AM, Luke Chen  wrote:
 
 Hi all,
 
 I'd like to propose a KIP to expose a metric for log recovery progress.
 This metric would let the admins have a way to monitor the log recovery
 progress.
 Details can be found here:
 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-831%3A+Add+metric+for+log+recovery+progress
 
 Any feedback is appreciated.
 
 Thank you.
 Luke
>>> 
>> 



Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-04 Thread Luke Chen
Hi devs,

If there are no other comments, I'll start a vote tomorrow.

Thank you.
Luke

On Sun, May 1, 2022 at 5:08 PM Luke Chen  wrote:

> Hi James,
>
> Sorry for the late reply.
>
> Yes, this is a good point, to know how many segments to be recovered if
> there are some large partitions.
> I've updated the KIP, to add a `*RemainingSegmentsToRecover*` metric for
> each log recovery thread, to show the value.
> The example in the Proposed section here
> 
> shows what it will look like.
>
> Thanks for the suggestion.
>
> Thank you.
> Luke
>
>
>
> On Sat, Apr 23, 2022 at 8:54 AM James Cheng  wrote:
>
>> The KIP describes RemainingLogsToRecovery, which seems to be the number
>> of partitions in each log.dir.
>>
>> We have some partitions which are much much larger than others. Those
>> large partitions have many many more segments than others.
>>
>> Is there a way the metric can reflect partition size? Could it be
>> RemainingSegmentsToRecover? Or even RemainingBytesToRecover?
>>
>> -James
>>
>> Sent from my iPhone
>>
>> > On Apr 20, 2022, at 2:01 AM, Luke Chen  wrote:
>> >
>> > Hi all,
>> >
>> > I'd like to propose a KIP to expose a metric for log recovery progress.
>> > This metric would let the admins have a way to monitor the log recovery
>> > progress.
>> > Details can be found here:
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-831%3A+Add+metric+for+log+recovery+progress
>> >
>> > Any feedback is appreciated.
>> >
>> > Thank you.
>> > Luke
>>
>


Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-05-01 Thread Luke Chen
Hi James,

Sorry for the late reply.

Yes, this is a good point, to know how many segments to be recovered if
there are some large partitions.
I've updated the KIP, to add a `*RemainingSegmentsToRecover*` metric for
each log recovery thread, to show the value.
The example in the Proposed section here

shows what it will look like.

Thanks for the suggestion.

Thank you.
Luke



On Sat, Apr 23, 2022 at 8:54 AM James Cheng  wrote:

> The KIP describes RemainingLogsToRecovery, which seems to be the number of
> partitions in each log.dir.
>
> We have some partitions which are much much larger than others. Those
> large partitions have many many more segments than others.
>
> Is there a way the metric can reflect partition size? Could it be
> RemainingSegmentsToRecover? Or even RemainingBytesToRecover?
>
> -James
>
> Sent from my iPhone
>
> > On Apr 20, 2022, at 2:01 AM, Luke Chen  wrote:
> >
> > Hi all,
> >
> > I'd like to propose a KIP to expose a metric for log recovery progress.
> > This metric would let the admins have a way to monitor the log recovery
> > progress.
> > Details can be found here:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-831%3A+Add+metric+for+log+recovery+progress
> >
> > Any feedback is appreciated.
> >
> > Thank you.
> > Luke
>


Re: [DISCUSS] KIP-831: Add metric for log recovery progress

2022-04-22 Thread James Cheng
The KIP describes RemainingLogsToRecovery, which seems to be the number of 
partitions in each log.dir. 

We have some partitions which are much much larger than others. Those large 
partitions have many many more segments than others. 

Is there a way the metric can reflect partition size? Could it be 
RemainingSegmentsToRecover? Or even RemainingBytesToRecover?

-James

Sent from my iPhone

> On Apr 20, 2022, at 2:01 AM, Luke Chen  wrote:
> 
> Hi all,
> 
> I'd like to propose a KIP to expose a metric for log recovery progress.
> This metric would let the admins have a way to monitor the log recovery
> progress.
> Details can be found here:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-831%3A+Add+metric+for+log+recovery+progress
> 
> Any feedback is appreciated.
> 
> Thank you.
> Luke


[DISCUSS] KIP-831: Add metric for log recovery progress

2022-04-20 Thread Luke Chen
Hi all,

I'd like to propose a KIP to expose a metric for log recovery progress.
This metric would let the admins have a way to monitor the log recovery
progress.
Details can be found here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-831%3A+Add+metric+for+log+recovery+progress

Any feedback is appreciated.

Thank you.
Luke