Re: Propose a KIP to report "REAL" broker/consumer fetch latency?

2021-07-27 Thread Kai Huang
Hi Ming, I will be interested in the proposed capability to diagnose Kafka 
latency issues and continue the discussion. Do you mind if I take over this 
discussion thread and follow up with the community?

On 2021/04/25 17:33:10, Ming Liu  wrote: 
> The idea I am trying right now is:
> 1. Add waitTimeMS in FetchResponse.
> 2. If the fetch has to wait in purgatory due to either
> replica.fetch.wait.max.ms or fetch.min.bytes, then it will fill the
> waitTimeMS in FetchResponse.
> 3. In updateRequestMetrics() function, we will special-process the Fetch
> response, and remove the waitTimeMS out of RemoteTime and TotalTime.
> Let me know for any suggestion/feedback.  I like to propose a KIP on that
> change.
> 
> 
> On Sat, Apr 24, 2021 at 6:09 PM Israel Ekpo  wrote:
> 
> > Hi Ming
> >
> > This would be a useful metric from a monitoring perspective especially
> > when troubleshooting or diagnosing issues.
> >
> > Are you looking to modify the Admin API for this capability to be added?
> > The metrics for quorum controllers, brokers, replicas and consumers may
> > need to be reported differently
> >
> > I am interested in this capability as well.
> >
> > Maybe there is something in the current Admin API that is not obvious yet
> > so I will need to investigate first and will get back to you with my
> > thoughts/suggestions.
> >
> > Thanks for bringing this up
> >
> > Cheers
> >
> >
> >
> > On Sat, Apr 24, 2021 at 1:21 PM Ming Liu  wrote:
> >
> >> Hi All,
> >>  I am thinking about to start a KIP to report "REAL" broker/consumer
> >> fetch latency. Before that, I like to collect any idea or suggestions.  I
> >> created https://issues.apache.org/jira/browse/KAFKA-12713.
> >>  The fetch latency is an important metric to monitor for the cluster
> >> performance. With ACK=ALL, the produce latency is affected primarily by
> >> broker fetch latency.  However, currently the reported fetch latency
> >> didn't
> >> reflect the true fetch latency because it sometimes needs to stay in
> >> purgatory and wait for replica.fetch.wait.max.ms when data is not
> >> available. This greatly affects the real P50, P99 etc.
> >>
> >> I like to propose a KIP to be able track the real fetch latency for both
> >> broker follower and consumer.
> >>
> >> Ming
> >>
> >
> 


Re: Request Permission to Contribute to Apache Kafka

2021-07-27 Thread Matthias J. Sax
There are three wiki accounts with the name "Kai Huang" -- I added the
one with id `kaihuang`.

Let us know if it was the wrong one.


-Matthias


On 7/27/21 12:15 PM, Kai Huang wrote:
> To whom it may concern,
> 
> Hi, I'd like to request permission to contribute to Apache Kafka by
> following instructions here:
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> 
> My wiki ID:  Kai Huang
> My Jira ID: kaihuang
> 
> Can someone help grant the permission for me? Your help is greatly
> appreciated!
> 
> Kai
> 


Re: [DISCUSS] Apache Kafka 3.0.0 release plan with new updated dates

2021-07-27 Thread Konstantine Karantasis
Thanks Ryan and Ron for reporting these new issues.

KAFKA-13142 is approved as an issue with no good workaround (and keeping in
mind that we have a few open blockers still) and
KAFKA-13137 is approved as a blocker because it's a regression.

Konstantine




On Tue, Jul 27, 2021 at 5:22 PM Ron Dagostino  wrote:

> Hi Konstantine. I've opened KAFKA-13137 as a potential blocker.  An
> approved PR is available at
> https://github.com/apache/kafka/pull/11131.  The kafka.controller
> metrics that the KRaft controllers expose have the wrong MBean names.
>
> Ron
>
> > On Jul 27, 2021, at 8:13 PM, Ryan Dielhenn 
> wrote:
> >
> > Hello,
>
> >
> > I would like to report a bug in KRaft.
> >
> > Dynamic broker configs are not being validated on the brokers before
> being
> > forwarded to the controller and persisted in the metadata quorum. This
> is a
> > blocker because a core requirement of KRaft mode in 3.0 is that it should
> > support upgrades from 3.0. If invalid dynamic configs are persisted to
> > metadata then they will still be there after an upgrade.
> >
> > I believe that this should be considered as a blocker for 3.0 here is the
> > JIRA: https://issues.apache.org/jira/browse/KAFKA-13142. I will be
> working
> > on a PR this week.
> >
> > Best Regards,
> > Ryan
> >
> >> On Wed, May 26, 2021 at 2:48 PM Konstantine Karantasis
> >>  wrote:
> >>
> >> Hi all,
> >>
> >> Please find below the updated release plan for the Apache Kafka 3.0.0
> >> release.
> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=177046466
> >>
> >> New suggested dates for the release are as follows:
> >>
> >> KIP Freeze is 09 June 2021 (same date as in the initial plan)
> >> Feature Freeze is 30 June 2021 (new date, extended by two weeks)
> >> Code Freeze is 14 July 2021 (new date, extended by two weeks)
> >>
> >> At least two weeks of stabilization will follow Code Freeze.
> >>
> >> The release plan is up to date and currently includes all the approved
> KIPs
> >> that are targeting 3.0.0.
> >>
> >> Please let me know if you have any objections with the recent extension
> of
> >> Feature Freeze and Code Freeze or any other concerns.
> >>
> >> Regards,
> >> Konstantine
> >>
>


Re: [DISCUSS] Apache Kafka 3.0.0 release plan with new updated dates

2021-07-27 Thread Ron Dagostino
Hi Konstantine. I've opened KAFKA-13137 as a potential blocker.  An
approved PR is available at
https://github.com/apache/kafka/pull/11131.  The kafka.controller
metrics that the KRaft controllers expose have the wrong MBean names.

Ron

> On Jul 27, 2021, at 8:13 PM, Ryan Dielhenn  
> wrote:
>
> Hello,

>
> I would like to report a bug in KRaft.
>
> Dynamic broker configs are not being validated on the brokers before being
> forwarded to the controller and persisted in the metadata quorum. This is a
> blocker because a core requirement of KRaft mode in 3.0 is that it should
> support upgrades from 3.0. If invalid dynamic configs are persisted to
> metadata then they will still be there after an upgrade.
>
> I believe that this should be considered as a blocker for 3.0 here is the
> JIRA: https://issues.apache.org/jira/browse/KAFKA-13142. I will be working
> on a PR this week.
>
> Best Regards,
> Ryan
>
>> On Wed, May 26, 2021 at 2:48 PM Konstantine Karantasis
>>  wrote:
>>
>> Hi all,
>>
>> Please find below the updated release plan for the Apache Kafka 3.0.0
>> release.
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=177046466
>>
>> New suggested dates for the release are as follows:
>>
>> KIP Freeze is 09 June 2021 (same date as in the initial plan)
>> Feature Freeze is 30 June 2021 (new date, extended by two weeks)
>> Code Freeze is 14 July 2021 (new date, extended by two weeks)
>>
>> At least two weeks of stabilization will follow Code Freeze.
>>
>> The release plan is up to date and currently includes all the approved KIPs
>> that are targeting 3.0.0.
>>
>> Please let me know if you have any objections with the recent extension of
>> Feature Freeze and Code Freeze or any other concerns.
>>
>> Regards,
>> Konstantine
>>


Re: [DISCUSS] Apache Kafka 3.0.0 release plan with new updated dates

2021-07-27 Thread Ryan Dielhenn
Hello,

I would like to report a bug in KRaft.

Dynamic broker configs are not being validated on the brokers before being
forwarded to the controller and persisted in the metadata quorum. This is a
blocker because a core requirement of KRaft mode in 3.0 is that it should
support upgrades from 3.0. If invalid dynamic configs are persisted to
metadata then they will still be there after an upgrade.

I believe that this should be considered as a blocker for 3.0 here is the
JIRA: https://issues.apache.org/jira/browse/KAFKA-13142. I will be working
on a PR this week.

Best Regards,
Ryan

On Wed, May 26, 2021 at 2:48 PM Konstantine Karantasis
 wrote:

> Hi all,
>
> Please find below the updated release plan for the Apache Kafka 3.0.0
> release.
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=177046466
>
> New suggested dates for the release are as follows:
>
> KIP Freeze is 09 June 2021 (same date as in the initial plan)
> Feature Freeze is 30 June 2021 (new date, extended by two weeks)
> Code Freeze is 14 July 2021 (new date, extended by two weeks)
>
> At least two weeks of stabilization will follow Code Freeze.
>
> The release plan is up to date and currently includes all the approved KIPs
> that are targeting 3.0.0.
>
> Please let me know if you have any objections with the recent extension of
> Feature Freeze and Code Freeze or any other concerns.
>
> Regards,
> Konstantine
>


Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.0 #66

2021-07-27 Thread Apache Jenkins Server
See 




[jira] [Resolved] (KAFKA-13139) Empty response after requesting to restart a connector without the tasks results in NPE

2021-07-27 Thread Konstantine Karantasis (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantine Karantasis resolved KAFKA-13139.

Resolution: Fixed

> Empty response after requesting to restart a connector without the tasks 
> results in NPE
> ---
>
> Key: KAFKA-13139
> URL: https://issues.apache.org/jira/browse/KAFKA-13139
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 3.0.0
>Reporter: Konstantine Karantasis
>Assignee: Konstantine Karantasis
>Priority: Blocker
> Fix For: 3.0.0
>
>
> After https://issues.apache.org/jira/browse/KAFKA-4793 a response to restart 
> only the connector (without any tasks) returns OK with an empty body. 
> As system test runs revealed, this causes an NPE in 
> [https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/rest/RestClient.java#L135]
> We should return 204 (NO_CONTENT) instead. 
> This is a regression from previous behavior, therefore the ticket is marked 
> as a blocker candidate for 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13143) Disable Metadata endpoint for KRaft controller

2021-07-27 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-13143:
---

 Summary: Disable Metadata endpoint for KRaft controller
 Key: KAFKA-13143
 URL: https://issues.apache.org/jira/browse/KAFKA-13143
 Project: Kafka
  Issue Type: Improvement
Reporter: Jason Gustafson
Assignee: Jose Armando Garcia Sancio
 Fix For: 3.0.0


The controller currently implements Metadata incompletely. Specifically, it 
does not return the metadata for any topics in the cluster. This may tend to 
cause confusion to users. For example, if someone used the controller endpoint 
by mistake in `kafka-topics.sh --list`, then they would see no topics in the 
cluster, which would be surprising. It would be better for 3.0 to disable 
Metadata on the controller since we currently expect clients to connect through 
brokers anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13142) KRaft controller does not validate dynamic configs

2021-07-27 Thread Ryan Dielhenn (Jira)
Ryan Dielhenn created KAFKA-13142:
-

 Summary: KRaft controller does not validate dynamic configs
 Key: KAFKA-13142
 URL: https://issues.apache.org/jira/browse/KAFKA-13142
 Project: Kafka
  Issue Type: Task
  Components: kraft
Affects Versions: 3.0.0
Reporter: Ryan Dielhenn
Assignee: Ryan Dielhenn


The KRaft controller is not currently validating dynamic configs. To ensure 
that KRaft clusters are easily upgradable it would be a good idea to validate 
dynamic configs in the first release of KRaft so that invalid dynamic configs 
are never stored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Request Permission to Contribute to Apache Kafka

2021-07-27 Thread Kai Huang
To whom it may concern,

Hi, I'd like to request permission to contribute to Apache Kafka by
following instructions here:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

My wiki ID:  Kai Huang
My Jira ID: kaihuang

Can someone help grant the permission for me? Your help is greatly
appreciated!

Kai


[jira] [Created] (KAFKA-13141) Leader should not update follower fetch offset if diverging epoch is present

2021-07-27 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-13141:
---

 Summary: Leader should not update follower fetch offset if 
diverging epoch is present
 Key: KAFKA-13141
 URL: https://issues.apache.org/jira/browse/KAFKA-13141
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.7.1, 2.8.0
Reporter: Jason Gustafson
Assignee: Jason Gustafson
 Fix For: 3.0.0, 2.7.2, 2.8.1


In 2.7, we began doing fetcher truncation piggybacked on the Fetch protocol 
instead of using the old OffsetsForLeaderEpoch API. When truncation is 
detected, we return a `divergingEpoch` field in the Fetch response, but we do 
not set an error code. The sender is expected to check if the diverging epoch 
is present and truncate accordingly.

All of this works correctly in the fetcher implementation, but the problem is 
that the logic to update the follower fetch position on the leader does not 
take into account the diverging epoch present in the response. This means the 
fetch offsets can be updated incorrectly, which can lead to either log 
divergence or the loss of committed data.

For example, we hit the following case with 3 replicas. Leader 1 is elected in 
epoch 1 with an end offset of 100. The followers are at offset 101

Broker 1: (Leader) Epoch 1 from offset 100
Broker 2: (Follower) Epoch 1 from offset 101
Broker 3: (Follower) Epoch 1 from offset 101

Broker 1 receives fetches from 2 and 3 at offset 101. The leader detects the 
divergence and returns a diverging epoch in the fetch state. Nevertheless, the 
fetch positions for both followers are updated to 101 and the high watermark is 
advanced.

After brokers 2 and 3 had truncated to offset 100, broker 1 experienced a 
network partition of some kind and was kicked from the ISR. This caused broker 
2 to get elected, which resulted in the following state at the start of epoch 2.

Broker 1: (Follower) Epoch 2 from offset 101
Broker 2: (Leader) Epoch 2 from offset 100
Broker 3: (Follower) Epoch 2 from offset 100

Broker 2 was then able to write a new entry at offset 100 and the old record 
which may have been exposed to consumers was deleted by broker 1.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] KIP-761: Add Total Blocked Time Metric to Streams

2021-07-27 Thread Guozhang Wang
Hello Rohan,

Thanks for the KIP. As Bruno mentioned in the other thread could you update
the "New Metrics" that 1) we have sub-titles for streams, producer,
consumer metrics, just for clarification, and 2) update the "producer-id"
etc to "client-id" to be consistent with the existing metrics.

Otherwise, I'm +1


Guozhang


On Mon, Jul 26, 2021 at 12:49 PM Leah Thomas 
wrote:

> Hey Rohan,
>
> Thanks for pushing this KIP through. I'm +1, non-binding.
>
> Leah
>
> On Wed, Jul 21, 2021 at 7:09 PM Rohan Desai 
> wrote:
>
> > Now that the discussion thread's been open for a few days, I'm calling
> for
> > a vote on
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >
>


-- 
-- Guozhang


Re: [VOTE] KIP-763: Range queries with open endpoints

2021-07-27 Thread Guozhang Wang
Thank you, Patrick!

The KIP looks good to me, and I also agree with the pragmatic manner. +1
(binding)

Guozhang


On Thu, Jul 22, 2021 at 11:09 AM John Roesler  wrote:

> Thank you, Patrick,
>
> +1 (binding) from me as well!
>
> Thanks,
> -John
>
> On Thu, 2021-07-22 at 10:40 +0200, Bruno Cadonna wrote:
> > Hi Patrick,
> >
> > Thank you for the KIP!
> >
> > +1 (binding)
> >
> > Best,
> > Bruno
> >
> > On 22.07.21 03:47, Luke Chen wrote:
> > > Hi Patrick,
> > > I like this KIP!
> > >
> > > +1 (non-binding)
> > >
> > > Luke
> > >
> > > On Thu, Jul 22, 2021 at 7:04 AM Matthias J. Sax 
> wrote:
> > >
> > > > Thanks for the KIP.
> > > >
> > > > +1 (binding)
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 7/21/21 1:18 PM, Patrick Stuedi wrote:
> > > > > Hi all,
> > > > >
> > > > > Thanks for the feedback on the KIP, I have updated the KIP and
> would like
> > > > > to start the voting.
> > > > >
> > > > > The KIP can be found here:
> > > > >
> > > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-763%3A+Range+queries+with+open+endpoints
> > > > >
> > > > > Please vote in this thread.
> > > > >
> > > > > Thanks!
> > > > > -Patrick
> > > > >
> > > >
> > >
>
>
>

-- 
-- Guozhang


[jira] [Created] (KAFKA-13140) KRaft brokers do not expose kafka.controller metrics, breaking backwards compatibility

2021-07-27 Thread Ron Dagostino (Jira)
Ron Dagostino created KAFKA-13140:
-

 Summary: KRaft brokers do not expose kafka.controller metrics, 
breaking backwards compatibility
 Key: KAFKA-13140
 URL: https://issues.apache.org/jira/browse/KAFKA-13140
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 2.8.0, 3.0.0
Reporter: Ron Dagostino
Assignee: Ron Dagostino
 Fix For: 3.1.0


The following controller metrics are exposed on every broker in a 
ZooKeeper-based (i.e. non-KRaft) cluster regardless of whether the broker is 
the active controller or not, but these metrics are not exposed on KRaft nodes 
that have process.roles=broker (i.e. KRaft nodes that do not implement the 
controller role).  For backwards compatibility, KRaft nodes that are just 
brokers should expose these metrics with values all equal to 0: just like 
ZooKeeper-based brokers do when they are not the active controller.

kafka.controller:type=KafkaController,name=ActiveControllerCount
kafka.controller:type=KafkaController,name=GlobalTopicCount
kafka.controller:type=KafkaController,name=GlobalPartitionCount
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] KIP-745: Connect API to restart connector and tasks

2021-07-27 Thread Randall Hauch
FYI, we found a minor error in KIP-745 [1] where it described the old
behavior of the connector restart API as returning a "200 OK" response,
instead of the "204 NO CONTENT" response actually returned by this API in
AK 2.8 and earlier. See KAFKA-13139 [2] for the issue where this was
discovered and [3] for the corresponding code change. The KIP specifies
that the restart API should return the same response when the new query
parameters are `includeTasks=false` and `failedOnly=false`, which
corresponds to the REST API that was available in AK 2.8 and earlier.

I've corrected the KIP to reflect this older actual behavior of returning
"204 NO CONTENT".

We have *not* changed the KIP or the behavior of returning "202 ACCEPTED"
when the values of either or both of these query parameters is "true". Such
is the newer behavior added in KIP-745.

Best regards,

Randall


[1]
https://cwiki.apache.org/confluence/display/KAFKA/KIP-745%3A+Connect+API+to+restart+connector+and+tasks
[2] https://issues.apache.org/jira/browse/KAFKA-13139
[3] https://github.com/apache/kafka/pull/11132

On Thu, Jun 10, 2021 at 12:36 PM Randall Hauch  wrote:

> The vote passes with three binding +1s (Konstantine, Tom, me), three
> non-binding +1s (Ryanne, Kalpesh, Dongjin), and no -1 votes.
>
> Thanks all for taking the time to review and vote!
>
> Best regards,
>
> Randall
>
>
> On Thu, Jun 10, 2021 at 9:53 AM Tom Bentley  wrote:
>
>> Hi Randall,
>>
>> Thanks for the KIP, +1 (binding).
>>
>> Kind regards,
>>
>> Tom
>>
>> On Thu, Jun 10, 2021 at 4:09 AM Dongjin Lee  wrote:
>>
>> > +1 (non-binding).
>> >
>> > As of present:
>> >
>> > - binding: +2 (Randall, Konstantine)
>> > - non-binding: +3 (Ryanne, Kalpesh, Dongjin)
>> >
>> > We need one more +1 binding.
>> >
>> > Thanks,
>> > Dongjin
>> >
>> > On Tue, Jun 8, 2021 at 6:31 AM Kalpesh Patel
>> 
>> > wrote:
>> >
>> > > +1 (non-binding)
>> > >
>> > > Regards
>> > > -Kalpesh
>> > >
>> > > On Mon, Jun 7, 2021 at 3:10 PM Ryanne Dolan 
>> > wrote:
>> > >
>> > > > +1 (non-binding)
>> > > >
>> > > > Ryanne
>> > > >
>> > > > On Mon, Jun 7, 2021, 3:03 PM Konstantine Karantasis
>> > > >  wrote:
>> > > >
>> > > > > Thanks Randall.
>> > > > >
>> > > > > +1 (binding)
>> > > > >
>> > > > > Konstantine
>> > > > >
>> > > > > On Mon, Jun 7, 2021 at 12:47 PM Randall Hauch 
>> > > wrote:
>> > > > >
>> > > > > > Hello all,
>> > > > > >
>> > > > > > I would like to start a vote on KIP-745:
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-745%3A+Connect+API+to+restart+connector+and+tasks
>> > > > > >
>> > > > > > +1 (binding) from myself.
>> > > > > >
>> > > > > > Thanks, and best regards!
>> > > > > >
>> > > > > > Randall
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> > *Dongjin Lee*
>> >
>> > *A hitchhiker in the mathematical world.*
>> >
>> >
>> >
>> > *github:  github.com/dongjinleekr
>> > keybase:
>> https://keybase.io/dongjinleekr
>> > linkedin:
>> kr.linkedin.com/in/dongjinleekr
>> > speakerdeck:
>> > speakerdeck.com/dongjin
>> > *
>> >
>>
>


Re: [VOTE] KIP-690: Add additional configuration to control MirrorMaker 2 internal topics naming convention

2021-07-27 Thread Josep Prat
Thanks for the KIP!
+1 (non-binding) from my side!

On Tue, Jul 27, 2021 at 2:22 PM lobo xu  wrote:

> +1,I think that's a good idea. When we use MirrorMaker2, we do not want
> subject name conversion to occur. Hopefully there is a switch that can be
> configured.
>
> > 在 2021年7月27日,06:38,Omnia Ibrahim  写道:
> >
> > Bumping up this voting thread.
>
>

-- 

Josep Prat

*Aiven Deutschland GmbH*

Immanuelkirchstraße 26, 10405 Berlin

Amtsgericht Charlottenburg, HRB 209739 B

Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen

*m:* +491715557497

*w:* aiven.io

*e:* josep.p...@aiven.io


Re: [VOTE] KIP-690: Add additional configuration to control MirrorMaker 2 internal topics naming convention

2021-07-27 Thread lobo xu
+1,I think that's a good idea. When we use MirrorMaker2, we do not want subject 
name conversion to occur. Hopefully there is a switch that can be configured.

> 在 2021年7月27日,06:38,Omnia Ibrahim  写道:
> 
> Bumping up this voting thread.



Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

2021-07-27 Thread Sophie Blee-Goldman
Thanks for the clarifications, that all makes sense.

I'm ready to vote on the KIP, but can you just update the KIP first to
address Bruno's feedback? Ie just fix the tags and fill in the missing
fields.

For example it sounds like these would be thread-level metrics. You should
be able to figure out what the values should be from the KIP-444 doc,
there's a chart with the type and tags for all thread-level metrics.

Not 100% sure what Bruno meant by "group" but my guess would be whether
it's INFO/DEBUG/TRACE. This is probably one of the most important
things to include in a KIP that's introducing new metrics: how big is the
potential performance impact of recording these metrics? How big is the
intended audience, would these be useful to almost everyone or are they
more "niche"?

It sounds like maybe DEBUG would be most appropriate here -- WDYT?

-Sophie

On Thu, Jul 22, 2021 at 9:01 AM Bruno Cadonna  wrote:

> Hi Rohan,
>
> Thank you for the KIP!
>
> I agree that the KIP is well-motivated.
>
> What is not very clear is the metadata like type, group, and tags of the
> metrics. For example, there is not application-id tag in Streams and
> there is also no producer-id tag. The clients, i.e., producer, admin,
> consumer, and also Streams have a client-id tag, that corresponds to the
> producer-id, consumer-id, etc you use in the KIP.
>
> For examples of metadata used in Streams you can look at the following
> KIPs:
>
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
>
>
> Best,
> Bruno
>
> On 22.07.21 09:42, Rohan Desai wrote:
> > re sophie:
> >
> > The intent here was to include all blocked time (not just `RUNNING`). The
> > caller can window the total blocked time themselves, and that can be
> > compared with a timeseries of the state to understand the ratio in
> > different states. I'll update the KIP to include `committed`. The admin
> API
> > calls should be accounted for by the admin client iotime/iowaittime
> > metrics.
> >
> > On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai 
> > wrote:
> >
> >>> I remember now that we moved the round-trip PID's txn completion logic
> >> into
> >> init-transaction and commit/abort-transaction. So I think we'd count
> time
> >> as in StreamsProducer#initTransaction as well (admittedly it is in most
> >> cases a one-time thing).
> >>
> >> Makes sense - I'll update the KIP
> >>
> >> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai 
> >> wrote:
> >>
> >>>
>  I had a question - it seems like from the descriptionsof
> >>> `txn-commit-time-total` and `offset-commit-time-total` that they
> measure
> >>> similar processes for ALOS and EOS, but only `txn-commit-time-total` is
> >>> included in `blocked-time-total`. Why isn't `offset-commit-time-total`
> also
> >>> included?
> >>>
> >>> I've updated the KIP to include it.
> >>>
>  Aside from `flush-time-total`, `txn-commit-time-total` and
> >>> `offset-commit-time-total`, which will be producer/consumer client
> >>> metrics,
> >>> the rest of the metrics will be streams metrics that will be thread
> level,
> >>> is that right?
> >>>
> >>> Based on the feedback from Guozhang, I've updated the KIP to reflect
> that
> >>> the lower-level metrics are all client metrics that are then summed to
> >>> compute the blocked time metric, which is a Streams metric.
> >>>
> >>> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai 
> >>> wrote:
> >>>
> > Similarly, I think "txn-commit-time-total" and
>  "offset-commit-time-total" may better be inside producer and consumer
>  clients respectively.
> 
>  I agree for offset-commit-time-total. For txn-commit-time-total I'm
>  proposing we measure `StreamsProducer.commitTransaction`, which wraps
>  multiple producer calls (sendOffsets, commitTransaction)
> 
> >> For "txn-commit-time-total" specifically, besides
>  producer.commitTxn.
>  other txn-related calls may also be blocking, including
>  producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>  later in the doc, but did not include it as a separate metric, and
>  similarly, should we have a `txn-abort-time-total` as well? If yes,
>  could
>  you update the KIP page accordingly.
> 
>  `beginTransaction` is not blocking - I meant to remove that from that
>  doc. I'll add something for abort.
> 
>  On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai  >
>  wrote:
> 
> > Thanks for the review Guozhang! responding to your feedback inline:
> >
> >> 1) I agree that the current ratio metrics is just "snapshot in
> > point", and
> > more