[jira] [Created] (KAFKA-15594) Add 3.6.0 to streams upgrade/compatibility tests

2023-10-11 Thread Satish Duggana (Jira)
Satish Duggana created KAFKA-15594:
--

 Summary: Add 3.6.0 to streams upgrade/compatibility tests
 Key: KAFKA-15594
 URL: https://issues.apache.org/jira/browse/KAFKA-15594
 Project: Kafka
  Issue Type: Sub-task
Reporter: Satish Duggana






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15593) Add 3.6.0 to broker/client upgrade/compatibility tests

2023-10-11 Thread Satish Duggana (Jira)
Satish Duggana created KAFKA-15593:
--

 Summary: Add 3.6.0 to broker/client upgrade/compatibility tests
 Key: KAFKA-15593
 URL: https://issues.apache.org/jira/browse/KAFKA-15593
 Project: Kafka
  Issue Type: Sub-task
Reporter: Satish Duggana






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-985 Add reverseRange and reverseAll query over kv-store in IQv2

2023-10-11 Thread Matthias J. Sax

Quick addendum.

Some DSL operator use `range` and seems to rely on ascending order, 
too... Of course, if we say `range()` has no ordering guarantee, we 
would add `forwardRange()` and let the DSL use `forwardRange`, too.


The discussion of course also applies to `all()` and `reveserAll()`, and 
and I assume also `prefixScan()` (even if there is no "reverse" version 
for it).



On 10/11/23 10:22 PM, Matthias J. Sax wrote:

Thanks for raising this question Hanyu. Great find!

My interpretation is as follows (it's actually a warning signal that the 
API contract is not better defined, and we should fix this by extending 
JavaDocs and docs on the web page about it).


We have existing `range()` and `reverseRange()` methods on 
`ReadOnlyKeyValueStore` -- the interface itself is not typed (ie, just 
generics), and we state that we don't guarantee "logical order" because 
underlying stores are based on `byte[]` type. So far so... well.


However, to make matters worse, we are also not explicit if the 
underlying store implementation *must* return keys is 
byte[]-lexicographical order or not...


For `range()`, I would be kinda willing to accept that there is no 
ordering guarantee at all -- for example, if the underlying byte[]-store 
is hash-based and implements a full scan to answer a `range()` it might 
not be efficient, but also not incorrect if keys are be returned in some 
"random" (byte[]-)order. In isolation, I don't see an API contract 
violation.


However, `reverseRange` implicitly states with its name, that some 
"descending order" (base on keys) is expected. Given the JavaDoc comment 
about "logical" vs "byte[]" order, the contract (at least to me) is 
clear: returns records in descending byte[]-lexicographical order. -- 
Any other interpretation seems to be off? Curious to hear if you agree 
or disagree to this interpretation?




If this is correct, it means we are actually lacking a API contract for 
ascending byte[]-lexicographical range scan. Furthermore, a hash-based 
byte[]-store would need to actually explicitly sort it's result for 
`reverseRange` to not violate the contract.


To me, this raises the question if `range()` actually has a 
(non-explicit) contract about returning data in byte[]-lexicographical 
order? It seems a lot of people rely on this, and our default stores 
actually implement it this way. So if we don't look at `range()` in 
isolation, but look at the `ReadOnlyKeyValueStore` interface 
holistically, I would also buy the argument that `range()` implies 
"ascending "byte[]-lexicographical order". Thoughts?


To be frank: to me, it's pretty clear that the original idea to add 
`range()` was to return data in ascending order.



Question 1:
  - Do we believe that the range() contract is ascending 
byte[]-lexicographical order right now?


    If yes, I would propose to make it explicit in the JavaDocs.

    If no, I would also propose to make it explicit in the JavaDocs. In 
addition, it raises the question if a method `forwardRange()` (for the 
lack of a better idea about a name right now) is actually missing to 
provide such a contract?



Of course, we always depend on the serialization format for order, and 
if users need "logical order" they need to ensure to use a serialization 
format that align byte[]-lexicographical order to logical order. But for 
the scope of this work, I would not even try to open this can of worms...





Looking into `RangeQuery` the JavaDocs don't say anything about order. 
Thus, `RangeQuery#range()` could actually also be implemented by calling 
`reverseRange()` without violating the contract as it seems. A hash-base 
store could also implement it, without the need to explicitly sort...


What brings be back to my original though about having three types of 
results for `Range`

  - no ordering guarantee
  - ascending (we would only give byte[]-lexicographical order)
  - descending (we would only give byte[]-lexicographical order)

Again, I actually believe that the original intent of RangeQuery was to 
inherit the ascending order of `ReadOnlyKeyValueStore#range()`... Please 
keep me honest about it.  On the other hand, both APIs seems to be 
independent enough to not couple them... -- this could actually be a 
step into the right direction and would follow the underlying idea of 
IQv2 to begin with: decouple semantics for the store interfaces from the 
query types and semantics...



OR: we actually say that `RangeQuery#range` did implicitly inherit the 
(non explicit) "ascending byte[]-lexicographical" order of the 
underlying `ReadOnlyKeyValueStore`, and we just need to update the 
(Java)Docs to make it explicit. -- But it might go against the idea of 
IQv2 as stated above.



Furthermore, the consequence would be, that a potential custom 
hash-based store, would need to do extra work to `range()` to do the 
sorting (or of course might reject the query as "not supported"). -- Of 
course, a hash-based store would still need to do 

Re: [DISCUSS] KIP-985 Add reverseRange and reverseAll query over kv-store in IQv2

2023-10-11 Thread Matthias J. Sax

Thanks for raising this question Hanyu. Great find!

My interpretation is as follows (it's actually a warning signal that the 
API contract is not better defined, and we should fix this by extending 
JavaDocs and docs on the web page about it).


We have existing `range()` and `reverseRange()` methods on 
`ReadOnlyKeyValueStore` -- the interface itself is not typed (ie, just 
generics), and we state that we don't guarantee "logical order" because 
underlying stores are based on `byte[]` type. So far so... well.


However, to make matters worse, we are also not explicit if the 
underlying store implementation *must* return keys is 
byte[]-lexicographical order or not...


For `range()`, I would be kinda willing to accept that there is no 
ordering guarantee at all -- for example, if the underlying byte[]-store 
is hash-based and implements a full scan to answer a `range()` it might 
not be efficient, but also not incorrect if keys are be returned in some 
"random" (byte[]-)order. In isolation, I don't see an API contract 
violation.


However, `reverseRange` implicitly states with its name, that some 
"descending order" (base on keys) is expected. Given the JavaDoc comment 
about "logical" vs "byte[]" order, the contract (at least to me) is 
clear: returns records in descending byte[]-lexicographical order. -- 
Any other interpretation seems to be off? Curious to hear if you agree 
or disagree to this interpretation?




If this is correct, it means we are actually lacking a API contract for 
ascending byte[]-lexicographical range scan. Furthermore, a hash-based 
byte[]-store would need to actually explicitly sort it's result for 
`reverseRange` to not violate the contract.


To me, this raises the question if `range()` actually has a 
(non-explicit) contract about returning data in byte[]-lexicographical 
order? It seems a lot of people rely on this, and our default stores 
actually implement it this way. So if we don't look at `range()` in 
isolation, but look at the `ReadOnlyKeyValueStore` interface 
holistically, I would also buy the argument that `range()` implies 
"ascending "byte[]-lexicographical order". Thoughts?


To be frank: to me, it's pretty clear that the original idea to add 
`range()` was to return data in ascending order.



Question 1:
 - Do we believe that the range() contract is ascending 
byte[]-lexicographical order right now?


   If yes, I would propose to make it explicit in the JavaDocs.

   If no, I would also propose to make it explicit in the JavaDocs. In 
addition, it raises the question if a method `forwardRange()` (for the 
lack of a better idea about a name right now) is actually missing to 
provide such a contract?



Of course, we always depend on the serialization format for order, and 
if users need "logical order" they need to ensure to use a serialization 
format that align byte[]-lexicographical order to logical order. But for 
the scope of this work, I would not even try to open this can of worms...





Looking into `RangeQuery` the JavaDocs don't say anything about order. 
Thus, `RangeQuery#range()` could actually also be implemented by calling 
`reverseRange()` without violating the contract as it seems. A hash-base 
store could also implement it, without the need to explicitly sort...


What brings be back to my original though about having three types of 
results for `Range`

 - no ordering guarantee
 - ascending (we would only give byte[]-lexicographical order)
 - descending (we would only give byte[]-lexicographical order)

Again, I actually believe that the original intent of RangeQuery was to 
inherit the ascending order of `ReadOnlyKeyValueStore#range()`... Please 
keep me honest about it.  On the other hand, both APIs seems to be 
independent enough to not couple them... -- this could actually be a 
step into the right direction and would follow the underlying idea of 
IQv2 to begin with: decouple semantics for the store interfaces from the 
query types and semantics...



OR: we actually say that `RangeQuery#range` did implicitly inherit the 
(non explicit) "ascending byte[]-lexicographical" order of the 
underlying `ReadOnlyKeyValueStore`, and we just need to update the 
(Java)Docs to make it explicit. -- But it might go against the idea of 
IQv2 as stated above.



Furthermore, the consequence would be, that a potential custom 
hash-based store, would need to do extra work to `range()` to do the 
sorting (or of course might reject the query as "not supported"). -- Of 
course, a hash-based store would still need to do extract work to 
implement `ReadOnlyKeyValueStore#(reverse)Range()` correctly (or also 
throw an `UnsupportedOperationException`... -- However, if we keep store 
interface and query interface independent as intended by IQv2, we would 
allow a hash-based store to implement `RangeQuery#range()` even if the 
store does not support `ReadOnlyKeyValueStore#range()` (or only with 
additional sorting cost); such a decoupling sounds like an 

Re: [DISCUSS] KIP-969: Support range interactive queries for versioned state stores

2023-10-11 Thread Matthias J. Sax

Thanks for the update.



To retrieve

the latest value(s), the user must call just the asOf method with the MAX
value (asOf(MAX)). The same applies to KIP-968. Do you think it is clumsy,
Matthias?



Well, in KIP-968 calling `asOf` and passing in a timestamp is optional, 
and default is "latest", right? So while `asOf(MAX)` does the same 
thing, practically users would never call `asOf` for a "latest" query?


In this KIP, we enforce that users give us a key range (we have the 4 
static entry point methods to define a query for this), and we say we 
default to "no bounds" for time range by default.


The existing `RangeQuery` allows to query a range of keys for existing 
stores. It seems to be a common pattern to query a key-range on latest. 
-- in the current proposal, users would need to do:


MultiVersionedRangeQuery.withKeyRange(startKey, endKey).asOf(MAX);

Would like to hear from others if we think that's good user experience? 
If we agree to accept this, I think we should explain how to do this in 
the JavaDocs (and also regular docs... --- otherwise, I can already 
anticipate user question on all question-asking-channels how to do a 
"normal key range query". IMHO, the problem is not that the code itself 
it too clumsy, but that it's totally not obvious to uses how to express 
it without actually explaining it to them. It basically violated the API 
design rule "make it easy to use / simple things should be easy".


Btw: We could also re-use `RangeQuery` and add am implementation to 
`VersionedStateStore` to just accept this query type, with "key range 
over latest" semantics. -- The issue is of course, that uses need to 
know that the query would return `ValueAndTimestamp` and not plain `V` 
(or we add a translation step to unwrap the value, but we would lose the 
"validFrom" timestamp -- validTo would be `null`). Because type safety 
is a general issue in IQv2 it would not make it worse (in the strict 
sense), but I am also not sure if we want to dig an even deeper hole...



-Matthias


On 10/10/23 11:55 AM, Alieh Saeedi wrote:

Thanks, Matthias and Bruno, for the feedback on KIP-969. Here is a summary
of the updates I made to the KIP:

1.  I liked the idea of renaming methods as Matthias suggested.
2. I removed the allversions() method as I did in KIP-968. To retrieve
the latest value(s), the user must call just the asOf method with the MAX
value (asOf(MAX)). The same applies to KIP-968. Do you think it is clumsy,
Matthias?
3. I added a method to the *VersionedKeyValueStore *interface, as I did
for KIP-968.
4. Matthias: I do not get what you mean by your second comment. Isn't
the KIP already explicit about that?

> I assume, results are returned by timestamp for each key. The KIP
should be explicit about it.


Cheers,
Alieh



On Tue, Oct 3, 2023 at 6:07 AM Matthias J. Sax  wrote:


Thanks for updating the KIP.

Not sure if I agree or not with Bruno's idea to split the query types
further? In the end, we split them only because there is three different
return types: single value, value-iterator, key-value-iterator.

What do we gain by splitting out single-ts-range-key? In the end, for
range-ts-range-key the proposed class is necessary and is a superset
(one can set both timestamps to the same value, for single-ts lookup).

The mentioned simplification might apply to "single-ts-range-key" but I
don't see a simplification for the proposed (and necessary) query type?

On the other hand, I see an advantage of a single-ts-range-key for
querying over the "latest version" with a range of keys. For a
single-ts-range-key query, this it would be the default (similar to
VersionedKeyQuery with not asOf-timestamped defined).

In the current version of the KIP, (if we agree that default should
actually return "all versions" not "latest" -- this default was
suggested by Bruno on KIP-968 and makes sense to me, so we would need to
have the same default here to stay consistent), users would need to pass
in `from(Long.MAX).to(Long.MAX)` (if I got this right) to query the
latest point in time only, what seems to be clumsy? Or we could add a
`lastestKeyOnly` option to `MultiVersionedRangeQuery`, but it does seems
a little clumsy, too.




The overall order of the returned records is by Key


I assume, results are returned by timestamp for each key. The KIP should
be explicit about it.



To be very explicit, should we rename the methods to specify the key bound?

   - withRange -> withKeyRange
   - withLowerBound -> withLowerKeyBound
   - withUpperBound -> withUpperKeyBound
   - withNoBounds -> allKeys (or withNoKeyBounds, but we use
`allVersions` and not `noTimeBound` and should align the naming?)



-Matthias


On 9/6/23 5:25 AM, Bruno Cadonna wrote:

Hi Alieh,

Thanks for the KIP!

One high level comment/question:

I assume you separated single key queries into two classes because
versioned key queries return a single value and multi version key
queries 

Re: [DISCUSS] KIP-988 Streams Standby Task Update Listener

2023-10-11 Thread Colt McNealy
Sohpie—

Thank you very much for such a detailed review of the KIP. It might
actually be longer than the original KIP in the first place!

1. Ack'ed and fixed.

2. Correct, this is a confusing passage and requires context:

One thing on our list of TODO's regarding reliability is to determine how
to configure `session.timeout.ms`. In our Kubernetes Environment, an
instance of our Streams App can be terminated, restarted, and get back into
the "RUNNING" Streams state in about 20 seconds. We have two options here:
a) set session.timeout.ms to 30 seconds or so, and deal with 20 seconds of
unavailability for affected partitions, but avoid shuffling Tasks; or b)
set session.timeout.ms to a low value, such as 6 seconds (
heartbeat.interval.ms of 2000), and reduce the unavailability window during
a rolling bounce but incur an "extra" rebalance. There are several
different costs to a rebalance, including the shuffling of standby tasks.
JMX metrics are not fine-grained enough to give us an accurate picture of
what's going on with the whole Standby Task Shuffle Dance. I hypothesize
that the Standby Update Listener might help us clarify just how the
shuffling actually (not theoretically) works, which will help us make a
more informed decision about the session timeout config.

If you think this is worth putting in the KIP, I'll polish it and do so;
else, I'll remove the current half-baked explanation.

3. Overall, I agree with this. In our app, each Task has only one Store to
reduce the number of changelog partitions, so I sometimes forget the
distinction between the two concepts, as reflected in the KIP (:

3a. I don't like the word "Restore" here, since Restoration refers to an
Active Task getting caught up in preparation to resume processing.
`StandbyUpdateListener` is fine by me; I have updated the KIP. I am a
native Python speaker so I do prefer shorter names anyways (:

3b1. +1 to removing the word 'Task'.

3b2. I like `onUpdateStart()`, but with your permission I'd prefer
`onStandbyUpdateStart()` which matches the name of the Interface
"StandbyUpdateListener". (the python part of me hates this, however)

3b3. Going back to question 2), `earliestOffset` was intended to allow us
to more easily calculate the amount of state _already loaded_ in the store
by subtracting (startingOffset - earliestOffset). This would help us see
how much inefficiency is introduced in a rolling restart—if we end up going
from a situation with an up-to-date standby before the restart, and then
after the whole restart, the Task is shuffled onto an instance where there
is no previous state, then that is expensive. However, if the final
shuffling results in the Task back on an instance with a lot of pre-built
state, it's not expensive.

If a call over the network is required to determine the earliestOffset,
then this is a "hard no-go" for me, and we will remove it (I'll have to
check with Eduwer as he is close to having a working implementation). I
think we can probably determine what we wanted to see in a different
way, but it will take more thinking.. If `earliestOffset` is confusing,
perhaps rename it to `earliestChangelogOffset`?

`startingOffset` is easy to remove as it can be determined from the first
call to `onBatch{Restored/Updated/Processed/Loaded}()`.

Anyways, I've updated the JavaDoc in the interface; hopefully it's more
clear. Awaiting further instructions here.

3c. Good point; after thinking, my preference is `onBatchLoaded()`  ->
`onBatchUpdated()` -> `onBatchProcessed()` -> `onBatchRestored()`. I am
less fond of "processed" because when I was first learning Streams I
mistakenly thought that standby tasks actually processed the input topic
rather than loaded from the changelog. I'll defer to you here.

3d. +1 to `onUpdateSuspended()`, or better yet
`onStandbyUpdateSuspended()`. Will check about the implementation of
keeping track of the number of records loaded.

4a. I think this might be best in a separate KIP, especially given that
this is my and Eduwer's first time contributing to Kafka (so we want to
minimize the blast radius).

4b. I might respectfully (and timidly) push back here, RECYCLED for an
Active Task is a bit confusing to me. DEMOTED and MIGRATED make sense from
the standpoint of an Active Task, recycling to me sounds like throwing
stuff away, such that the resources (i.e. disk space) can be used by a
separate Task. As an alternative rather than trying to reuse the same enum,
maybe rename it to `StandbySuspendReason` to avoid naming conflicts with
`ActiveSuspendReason`? However, I could be convinced to rename PROMOTED ->
RECYCLED, especially if Eduwer agrees.

TLDR:

T1. Agreed, will remove the word "Task" as it's incorrect.
T2. Will update to `onStandbyUpdateStart()`
T3. Awaiting further instructions on earliestOffset and startingOffset.
T4. I don't like `onBatchProcessed()` too much, perhaps `onBatchLoaded()`?
T5. Will update to `onStandbyUpdateSuspended()`
T6. Thoughts on renaming SuspendReason to 

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2276

2023-10-11 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-10-11 Thread Matthias J. Sax

Thanks for the update!



Some thoughts about changes you made and open questions you raised:


10) About asOf vs until: I was just looking into `WindowKeyQuery`, 
`WindowRangeQuery` and also `ReadOnlyWindowStore` interfaces. For those, 
we use "timeFrom" and "timeTo", so it seems best to actually use 
`to(Instant toTime)` to keep the naming consistent across the board?


If yes, we should also do `from (Instant fromTime)` and use getters 
`fromTime()` and `toTime()` -- given that it's range bounds it seems 
acceptable to me, to diverge a little bit from KIP-960 `asOfTimestamp()` 
-- but we could also rename it to `asOfTime()`? -- Given that we 
strongly type with `Instant` I am not worried about semantic ambiguity.




20) About throwing a NPE when time bounds are `null` -- why? (For the 
key it makes sense as is mandatory to have a key.) Could we not 
interpret `null` as "no bound". We did KIP-941 to add `null` for 
open-ended `RangeQueries`, so I am wondering if we should just stick to 
the same semantics?


Cf 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-941%3A+Range+queries+to+accept+null+lower+and+upper+bounds




30) About the class naming. That's always tricky, and I am not married 
to my proposal. I agree with Bruno that the other suggested names are 
not really better. -- The underlying idea was, to get some "consistent" 
naming across the board.


Existing `KeyQuery`
New `VersionedKeyQuery` (KIP-960; we add a prefix)
New `MultiVersionKeyQuery` (this KIP; extend the prefix with a pre-prefix)

Existing `RangeQuery`
New `MultiVersionRangeQuery` (KIP-969; add same prefix as above)



40) I am fine with not adding `range(from, to)` -- it was just an idea.





Some more follow up question:

50) You propose to add a new constructor and getter to `VersionedRecord` 
-- I am wondering if this implies that `validTo` is optional because the 
existing constructor is not deprecated? -- Also, what happens if 
`validTo` is not set and `valueTo()` is called? Or do we intent to make 
`validTo` mandatory?


Maybe this question can only be answered when working on the code, but I 
am wondering if we should make `validTo` mandatory or not... And what 
the "blast radius" of changing `VersionedRecord` will be in general. Do 
you have already some POC PR that we could look at to get some signals 
about this?




60) The new query class is defined to return 
`ValueIterator>` -- while I like the idea to add 
`ValueIterator` in a generic way on the one hand, I am wondering if 
it might be better to change it, and enforce its usage (ie, return type) 
of `VersionedRecord` to improve type safety (type erasure is often a 
pain, and we could mitigate it this way).


Btw: We actually do a similar thing for `KeyValueIterator`.

Ie,

public interface ValueIterator extends Iterator>

and

ValueAndTimestamp peek();

This would imply that the return type of the new query is 
`ValueIterator` on the interface what seems simpler and more elegant?


If we go with the change, I am also wondering if we need to find a 
better name for the new iterator class? Maybe `VersionIterator` or 
something like this?


Of course it might limit the use of `ValueIterator` for other value 
types -- not sure if this a limitation that is prohibitive? My gut 
feeling is, that is should not be too limiting.





70) Do we really need the change in `VersionedKeyValueStore` and add a 
new method? In the end, the idea of IQv2 is to avoid exactly this... It 
was the main issue for IQv1, that the base interface of the store needed 
an update and thus all classed implementing the base interface, making 
it very cumbersome to add new query types. -- Of course, we need this 
new method on the actually implementation (as private method) that can 
be called from `query()` method, but adding it to the interface seems to 
defeat the purpose of IQv2.


Note, for existing IQv2 queries types that go against others stores, the 
public methods already existed when IQv2 was introduces, and thus the 
implementation of these query types just pragmatically re-used existing 
methods -- but it does not imply that new public method should be added.





-Matthias


On 10/11/23 5:11 AM, Bruno Cadonna wrote:

Thanks for the updates, Alieh!

The example in the KIP uses the allVersions() method which we agreed to 
remove.


Regarding your questions:
1. asOf vs. until: I am fine with both but slightly prefer until.
2. What about KeyMultiVersionsQuery, KeyVersionsQuery (KIP-960 would 
then be KeyVersionQuery). However, I am also fine with 
MultiVersionedKeyQuery since none of the names sounds better or worse to 
me.
3. I agree with you not to introduce the method with the two bounds to 
keep things simple.

4. Forget about fromTime() an asOfTime(), from() and asOf() is fine.
5. The main purpose is to show how to use the API. Maybe make an example 
with just the key to distinguish this query from the single value query 
of KIP-960 and then one with a key and a time range. 

[jira] [Created] (KAFKA-15592) Member does not need to always try to join a group when a groupId is configured

2023-10-11 Thread Philip Nee (Jira)
Philip Nee created KAFKA-15592:
--

 Summary: Member does not need to always try to join a group when a 
groupId is configured
 Key: KAFKA-15592
 URL: https://issues.apache.org/jira/browse/KAFKA-15592
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: Philip Nee
Assignee: Philip Nee


Currently, instantiating a membershipManager means the member will always seek 
to join a group unless it has failed fatally.  However, this is not always the 
case because the member should be able to join and leave a group any time 
during its life cycle. Maybe we should include an "inactive" state in the state 
machine indicating the member does not want to be in a rebalance group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Security for Kafka

2023-10-11 Thread Walchester Gaw
Hello.

I am trying to implement Quorum TLS by following the instructions in
https://zookeeper.apache.org/doc/r3.5.7/zookeeperAdmin.html#Quorum+TLS, but
I keep on encountering the following errors after doing the second rolling
restart where sslQuorum set to true.

   - [2023-10-11 05:46:03,250] WARN Cannot open channel to 3 at election
   address /xxx.xx.xx.xxx: (
   org.apache.zookeeper.server.quorum.QuorumCnxManager)
   javax.net.ssl.SSLHandshakeException: Received fatal alert:
   handshake_failure
   - [2023-10-11 05:47:12,513] WARN Closing connection to /xxx.xx.xx.
   xxx: (org.apache.zookeeper.server.NettyServerCnxn)
   java.io.IOException: ZK down

Our current Cluster setup consists of 3 Linux servers (Amazon EC2
instances) which contains one Zookeeper and Broker for each server. I have
tried using Private IP DNS name and Public IPv4 DNS as the alias and
distinguished name when generating the self signed certificate for each of
the servers. For the generation of CA key and CA certificate, I used the
Private IP DNS name and Public IPv4 DNS of one the servers as the common
name respectively. Do note I am generating all keystores/truststore in just
one server (this server's IP is indicated in CA key and CA cert) and
distributing them accordingly.

I made sure that all ZK is up and running when I am getting the ZK down
issue and I am getting that error for all three ZKs. I can also confirm
that the file path indicated in the zookeeper.properties where the keystore
and truststore is located is correct.

Can someone assist regarding this? What am I missing here?  Let me know if
you need more information.

I am also unsure if there is something like a community page for Kafka
where I can reach out to the community where hopefully someone with a
similar setup can help.

Thanks,
Chester


Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.6 #90

2023-10-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 307140 lines...]
> Task :connect:api:testClasses UP-TO-DATE
> Task :connect:json:generateMetadataFileForMavenJavaPublication
> Task :connect:api:testJar
> Task :clients:generateMetadataFileForMavenJavaPublication
> Task :connect:api:testSrcJar
> Task :connect:json:publishMavenJavaPublicationToMavenLocal
> Task :connect:json:publishToMavenLocal
> Task :connect:api:publishMavenJavaPublicationToMavenLocal
> Task :connect:api:publishToMavenLocal
> Task :storage:api:compileTestJava
> Task :storage:api:testClasses
> Task :server-common:compileTestJava
> Task :server-common:testClasses
> Task :raft:compileTestJava
> Task :raft:testClasses
> Task :group-coordinator:compileTestJava
> Task :group-coordinator:testClasses

> Task :clients:javadoc
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/clients/src/main/java/org/apache/kafka/common/config/ConfigDef.java:81:
 warning - Tag @link:illegal character: "60" in "#define(String, Type, 
Importance, String, String, int, Width, String, List)"
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/clients/src/main/java/org/apache/kafka/common/config/ConfigDef.java:81:
 warning - Tag @link:illegal character: "62" in "#define(String, Type, 
Importance, String, String, int, Width, String, List)"
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/clients/src/main/java/org/apache/kafka/common/config/ConfigDef.java:81:
 warning - Tag @link: can't find define(String, Type, Importance, String, 
String, int, Width, String, List) in 
org.apache.kafka.common.config.ConfigDef

> Task :core:compileScala
> Task :metadata:compileTestJava
> Task :metadata:testClasses

> Task :clients:javadoc
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/clients/src/main/java/org/apache/kafka/clients/admin/ScramMechanism.java:32:
 warning - Tag @see: missing final '>': "https://cwiki.apache.org/confluence/display/KAFKA/KIP-554%3A+Add+Broker-side+SCRAM+Config+API;>KIP-554:
 Add Broker-side SCRAM Config API

 This code is duplicated in 
org.apache.kafka.common.security.scram.internals.ScramMechanism.
 The type field in both files must match and must not change. The type field
 is used both for passing ScramCredentialUpsertion and for the internal
 UserScramCredentialRecord. Do not change the type field."
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/secured/package-info.java:21:
 warning - Tag @link: reference not found: 
org.apache.kafka.common.security.oauthbearer
5 warnings

> Task :clients:javadocJar
> Task :clients:srcJar
> Task :clients:testJar
> Task :clients:testSrcJar
> Task :clients:publishMavenJavaPublicationToMavenLocal
> Task :clients:publishToMavenLocal
> Task :core:classes
> Task :core:compileTestJava NO-SOURCE
> Task :core:compileTestScala
> Task :core:testClasses
> Task :streams:compileTestJava
> Task :streams:testClasses
> Task :streams:testJar
> Task :streams:testSrcJar
> Task :streams:publishMavenJavaPublicationToMavenLocal
> Task :streams:publishToMavenLocal

Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings 
and determine if they come from your own scripts or plugins.

For more on this, please refer to 
https://docs.gradle.org/8.2.1/userguide/command_line_interface.html#sec:command_line_warnings
 in the Gradle documentation.

BUILD SUCCESSFUL in 5m 32s
94 actionable tasks: 41 executed, 53 up-to-date

Publishing build scan...
https://ge.apache.org/s/numsezsf527sa

[Pipeline] sh
+ grep ^version= gradle.properties
+ cut -d= -f 2
[Pipeline] dir
Running in 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/streams/quickstart
[Pipeline] {
[Pipeline] sh
+ mvn clean install -Dgpg.skip
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Kafka Streams :: Quickstart[pom]
[INFO] streams-quickstart-java[maven-archetype]
[INFO] 
[INFO] < org.apache.kafka:streams-quickstart >-
[INFO] Building Kafka Streams :: Quickstart 3.6.1-SNAPSHOT[1/2]
[INFO]   from pom.xml
[INFO] [ pom ]-
[INFO] 
[INFO] --- clean:3.0.0:clean (default-clean) @ streams-quickstart ---
[INFO] 
[INFO] --- remote-resources:1.5:process (process-resource-bundles) @ 
streams-quickstart ---
[INFO] 
[INFO] --- site:3.5.1:attach-descriptor (attach-descriptor) @ 
streams-quickstart ---
[INFO] 
[INFO] --- gpg:1.6:sign (sign-artifacts) @ streams-quickstart ---
[INFO] 
[INFO] --- install:2.5.2:install (default-install) @ streams-quickstart ---
[INFO] Installing 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Matthias J. Sax

In can answer 130 and 131.

130) We cannot guarantee that all clients are already initialized due to 
race conditions. We plan to not allow calling 
`KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or 
REBALANCING) though -- guess this slipped on the KIP and should be 
added? But because StreamThreads can be added dynamically (and producer 
might be created dynamically at runtime; cf below), we still cannot 
guarantee that all clients are already initialized when the method is 
called. Of course, we assume that all clients are most likely initialize 
on the happy path, and blocking calls to `client.clientInstanceId()` 
should be rare.


To address the worst case, we won't do a naive implementation and just 
loop over all clients, but fan-out the call to the different 
StreamThreads (and GlobalStreamThread if it exists), and use Futures to 
gather the results.


Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so 
we might do 3 blocking calls in the worst case (for EOSv1 we get a 
producer per tasks, and we might end up doing more blocking calls if the 
producers are not initialized yet). Note that EOSv1 is already 
deprecated, and we are also working on thread refactoring that will 
reduce the number of client on StreamThread to 2 -- and we have more 
refactoring planned to reduce the number of clients even further.


Inside `KafakStreams#clientsInstanceIds()` we might only do single 
blocking call for the admin client (ie, `admin.clientInstanceId()`).


I agree that we need to do some clever timeout management, but it seems 
to be more of an implementation detail?


Do you have any particular concerns, or does the proposed implementation 
as sketched above address your question?



130) If the Topology does not have a global-state-store, there won't be 
a GlobalThread and thus not global consumer. Thus, we return an Optional.




On three related question for Andrew.

(1) Why is the method called `clientInstanceId()` and not just plain 
`instanceId()`?


(2) Why so we return a `String` while but not a UUID type? The added 
protocol request/response classes use UUIDs.


(3) Would it make sense to have an overloaded `clientInstanceId()` 
method that does not take any parameter but uses `default.api.timeout` 
config (this config does no exist on the producer though, so we could 
only have it for consumer and admin at this point). We could of course 
also add overloads like this later if user request them (and/or add 
`default.api.timeout.ms` to the producer, too).


Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as 
a method name though, as `KafkaStreams` itself does not have an 
`instanceId` -- we can also not have a timeout-less overload, because 
`KafkaStreams` does not have a `default.api.timeout.ms` config either 
(and I don't think it make sense to add).




-Matthias

On 10/11/23 5:07 PM, Jun Rao wrote:

Hi, Andrew,

Thanks for the updated KIP. Just a few more minor comments.

130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
consumer/producer/adminClient instances to be initialized? Are all those
instances created during KafkaStreams initialization?

131. Why does globalConsumerInstanceId() return Optional while
other consumer instances don't return Optional?

132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
set of generic metrics
(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
report Request rate for every request type?

Thanks,

Jun

On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax  wrote:


Thanks!

On 10/10/23 11:31 PM, Andrew Schofield wrote:

Matthias,
Yes, I think that’s a sensible way forward and the interface you propose

looks good. I’ll update the KIP accordingly.


Thanks,
Andrew


On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to be

important, as we don't know if/when a follow-up KIP for Kafka Streams would
land.


I was also thinking (and discussed with a few others) how to expose it,

and we would propose the following:


We add a new method to `KafkaStreams` class:

 public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

   public class ClientsInstanceIds {
 // we only have a single admin client per KS instance
 String adminInstanceId();

 // we only have a single global consumer per KS instance (if any)
 // Optional<> because we might not have global-thread
 Optional globalConsumerInstanceId();

 // return a  ClientInstanceId> mapping
 // for the underlying (restore-)consumers/producers
 Map mainConsumerInstanceIds();
 Map restoreConsumerInstanceIds();
 Map producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

   [Stream|StateUpdater]Thread-


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good 

Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2275

2023-10-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 315649 lines...]
Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToSuspendShouldRemoveTaskFromPendingUpdateActions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldOnlyKeepLastUpdateAction() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldOnlyKeepLastUpdateAction() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToRecycle() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToRecycle() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToUpdateInputPartitionsShouldRemoveTaskFromPendingUpdateActions()
 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToUpdateInputPartitionsShouldRemoveTaskFromPendingUpdateActions()
 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldVerifyIfPendingTaskToRecycleExist() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldVerifyIfPendingTaskToRecycleExist() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToUpdateInputPartitions() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToUpdateInputPartitions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToCloseDirtyShouldRemoveTaskFromPendingUpdateActions() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToCloseDirtyShouldRemoveTaskFromPendingUpdateActions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToSuspend() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToSuspend() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldVerifyIfPendingTaskToInitExist() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldVerifyIfPendingTaskToInitExist() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldDrainPendingTasksToCreate() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldDrainPendingTasksToCreate() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldKeepAddedTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldKeepAddedTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Zero query results shouldn't error STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Zero query results shouldn't error PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Valid query results still works STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Valid query results still works PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@647e5999, 

[jira] [Created] (KAFKA-15591) Trogdor produce workload reports errors in KRaft mode

2023-10-11 Thread Xi Yang (Jira)
Xi Yang created KAFKA-15591:
---

 Summary: Trogdor produce workload reports errors in KRaft mode
 Key: KAFKA-15591
 URL: https://issues.apache.org/jira/browse/KAFKA-15591
 Project: Kafka
  Issue Type: Bug
 Environment: Linux
Reporter: Xi Yang


The Kafka benchmark in the Dacapo Benchmark Suite uses the Trogdor's exec mode 
([https://github.com/dacapobench/dacapobench/pull/224)]  to test the Kafka 
broker.

 

I am trying to update the benchmark to use the KRaft protocol. We use single 
Kafka instant that plays both controller and broker following the guide in 
Kafka README.md 
(https://github.com/apache/kafka#running-a-kafka-broker-in-kraft-mode).

 

However, the Trogdor producing workload  (tests/spec/simple_produce_bench.json) 
reports the NOT_LEADER_OR_FOLLOWER error. The errors are gone after many time 
of retry. Is this caused by that in KRaft protocal, Kafka doesn't not elect 
leaders immediately after a new topic created but rather do that on-demand 
after receiving the first message on the topic? If this is the root cause, Is 
there a way to ask Kafka to elect the leader after creating the topic?
{code:java}
// code placeholder
./bin/trogdor.sh agent -n node0 -c ./config/trogdor.conf --exec 
./tests/spec/simple_produce_bench.json
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/xyang/code/kafka/tools/build/dependant-libs-2.13.12/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/xyang/code/kafka/trogdor/build/dependant-libs-2.13.12/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
Oct 12, 2023 12:30:50 AM org.glassfish.jersey.server.wadl.WadlFeature configure
WARNING: JAXBContext implementation could not be found. WADL feature is 
disabled.
Oct 12, 2023 12:30:50 AM org.glassfish.jersey.internal.inject.Providers 
checkProviderRuntime
WARNING: A provider org.apache.kafka.trogdor.agent.AgentRestResource registered 
in SERVER runtime does not implement any provider interfaces applicable in the 
SERVER runtime. Due to constraint configuration problems the provider 
org.apache.kafka.trogdor.agent.AgentRestResource will be ignored.
Waiting for completion of task:{
  "class" : "org.apache.kafka.trogdor.workload.ProduceBenchSpec",
  "startMs" : 1697070650540,
  "durationMs" : 1000,
  "producerNode" : "node0",
  "bootstrapServers" : "localhost:9092",
  "targetMessagesPerSec" : 1,
  "maxMessages" : 5,
  "keyGenerator" : {
    "type" : "sequential",
    "size" : 4,
    "startOffset" : 0
  },
  "valueGenerator" : {
    "type" : "constant",
    "size" : 512,
    "value" : 
"AAA="
  },
  "activeTopics" : {
    "foo[1-3]" : {
      "numPartitions" : 10,
      "replicationFactor" : 1
    }
  },
  "inactiveTopics" : {
    "foo[4-5]" : {
      "numPartitions" : 10,
      "replicationFactor" : 1
    }
  },
  "useConfiguredPartitioner" : false,
  "skipFlush" : false
}
[2023-10-12 00:30:50,862] WARN [Producer clientId=producer-1] Got error produce 
response with correlation id 6 on topic-partition foo1-8, retrying (2147483646 
attempts left). Error: NOT_LEADER_OR_FOLLOWER 
(org.apache.kafka.clients.producer.internals.Sender)
[2023-10-12 00:30:50,862] WARN [Producer clientId=producer-1] Received invalid 
metadata error in produce request on partition foo1-8 due to 
org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests 
intended only for the leader, this error indicates that the broker is not the 
current leader. For requests intended for any replica, this error indicates 
that the broker is not a replica of the topic partition.. Going to request 
metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2023-10-12 00:30:50,870] WARN [Producer clientId=producer-1] Got error produce 
response with correlation id 8 on topic-partition foo2-8, retrying (2147483646 
attempts left). Error: NOT_LEADER_OR_FOLLOWER 
(org.apache.kafka.clients.producer.internals.Sender)
[2023-10-12 00:30:50,870] WARN [Producer clientId=producer-1] 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Jun Rao
Hi, Andrew,

Thanks for the updated KIP. Just a few more minor comments.

130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all
consumer/producer/adminClient instances to be initialized? Are all those
instances created during KafkaStreams initialization?

131. Why does globalConsumerInstanceId() return Optional while
other consumer instances don't return Optional?

132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a
set of generic metrics
(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
report Request rate for every request type?

Thanks,

Jun

On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax  wrote:

> Thanks!
>
> On 10/10/23 11:31 PM, Andrew Schofield wrote:
> > Matthias,
> > Yes, I think that’s a sensible way forward and the interface you propose
> looks good. I’ll update the KIP accordingly.
> >
> > Thanks,
> > Andrew
> >
> >> On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
> >>
> >> Andrew,
> >>
> >> yes I would like to get this change into KIP-714 right way. Seems to be
> important, as we don't know if/when a follow-up KIP for Kafka Streams would
> land.
> >>
> >> I was also thinking (and discussed with a few others) how to expose it,
> and we would propose the following:
> >>
> >> We add a new method to `KafkaStreams` class:
> >>
> >> public ClientsInstanceIds clientsInstanceIds(Duration timeout);
> >>
> >> The returned object is like below:
> >>
> >>   public class ClientsInstanceIds {
> >> // we only have a single admin client per KS instance
> >> String adminInstanceId();
> >>
> >> // we only have a single global consumer per KS instance (if any)
> >> // Optional<> because we might not have global-thread
> >> Optional globalConsumerInstanceId();
> >>
> >> // return a  ClientInstanceId> mapping
> >> // for the underlying (restore-)consumers/producers
> >> Map mainConsumerInstanceIds();
> >> Map restoreConsumerInstanceIds();
> >> Map producerInstanceIds();
> >> }
> >>
> >> For the `threadKey`, we would use some pattern like this:
> >>
> >>   [Stream|StateUpdater]Thread-
> >>
> >>
> >> Would this work from your POV?
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 10/9/23 2:15 AM, Andrew Schofield wrote:
> >>> Hi Matthias,
> >>> Good point. Makes sense to me.
> >>> Is this something that can also be included in the proposed Kafka
> Streams follow-on KIP, or would you prefer that I add it to KIP-714?
> >>> I have a slight preference for the former to put all of the KS
> enhancements into a separate KIP.
> >>> Thanks,
> >>> Andrew
>  On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:
> 
>  Thanks Andrew. SGTM.
> 
>  One point you did not address is the idea to add a method to
> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be
> added to consumer/producer/admin clients.
> 
>  Without addressing this, Kafka Streams users won't have a way to get
> the assigned `instanceId` of the internally created clients, and thus it
> would be very difficult for them to know which metrics that the broker
> receives belong to a Kafka Streams app. It seems they would only find the
> `instanceIds` in the log4j output if they enable client logging?
> 
>  Of course, because there is multiple clients inside Kafka Streams,
> the return type cannot be an single "String", but must be some some complex
> data structure -- we could either add a new class, or return a
> Map using a client key that maps to the `instanceId`.
> 
>  For example we could use the following key:
> 
> [Global]StreamThread[-][-restore][consumer|producer]
> 
>  (Of course, only the valid combination.)
> 
>  Or maybe even better, we might want to return a `Future` because
> collection all the `instanceId` might be a blocking all on each client? I
> have already a few idea how it could be implemented but I don't think it
> must be discussed on the KIP, as it's an implementation detail.
> 
>  Thoughts?
> 
> 
>  -Matthias
> 
>  On 10/6/23 4:21 AM, Andrew Schofield wrote:
> > Hi Matthias,
> > Thanks for your comments. I agree that a follow-up KIP for Kafka
> Streams makes sense. This KIP currently has made a bit
> > of an effort to embrace KS, but it’s not enough by a long way.
> > I have removed `application.id `. This
> should be done properly in the follow-up KIP. I don’t believe there’s a
> downside to
> > removing it from this KIP.
> > I have reworded the statement about temporarily. In practice, the
> implementation of this KIP that’s going on while the voting
> > progresses happens to use delta temporality, but that’s an
> implementation detail. Supporting clients must support both
> > temporalities.
> > I thought about exposing the client instance ID as a metric, but
> non-numeric metrics are not usual practice and tools
> > do not universally support them. I 

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-10-11 Thread Calvin Liu
Hi David,
Thanks for the comment.
Yes, we can separate the ELR enablement from the metadata version. It is
also helpful to avoid blocking the following MV releases if the user is not
ready for ELR.
One thing to correct is that, the Unclean recovery is controlled
by unclean.recovery.manager.enabled, a separate config
from unclean.recovery.strategy. It determines whether unclean recovery will
be used in an unclean leader election.
Thanks

On Wed, Oct 11, 2023 at 4:11 PM David Arthur  wrote:

> One thing we should consider is a static config to totally enable/disable
> the ELR feature. If I understand the KIP correctly, we can effectively
> disable the unclean recovery by setting the recovery strategy config to
> "none".
>
> This would make development and rollout of this feature a bit smoother.
> Consider the case that we find bugs in ELR after a cluster has updated to
> its MetadataVersion. It's simpler to disable the feature through config
> rather than going through a MetadataVersion downgrade (once that's
> supported).
>
> Does that make sense?
>
> -David
>
> On Wed, Oct 11, 2023 at 1:40 PM Calvin Liu 
> wrote:
>
> > Hi Jun
> > -Good catch, yes, we don't need the -1 in the DescribeTopicRequest.
> > -No new value is added. The LeaderRecoveryState will still be set to 1 if
> > we have an unclean leader election. The unclean leader election includes
> > the old random way and the unclean recovery. During the unclean recovery,
> > the LeaderRecoveryState will not change until the controller decides to
> > update the records with the new leader.
> > Thanks
> >
> > On Wed, Oct 11, 2023 at 9:02 AM Jun Rao 
> wrote:
> >
> > > Hi, Calvin,
> > >
> > > Another thing. Currently, when there is an unclean leader election, we
> > set
> > > the LeaderRecoveryState in PartitionRecord and PartitionChangeRecord to
> > 1.
> > > With the KIP, will there be new values for LeaderRecoveryState? If not,
> > > when will LeaderRecoveryState be set to 1?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Oct 10, 2023 at 4:24 PM Jun Rao  wrote:
> > >
> > > > Hi, Calvin,
> > > >
> > > > One more comment.
> > > >
> > > > "The first partition to fetch details for. -1 means to fetch all
> > > > partitions." It seems that FirstPartitionId of 0 naturally means
> > fetching
> > > > all partitions?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Oct 10, 2023 at 12:40 PM Calvin Liu
>  > >
> > > > wrote:
> > > >
> > > >> Hi Jun,
> > > >> Yeah, with the current Metadata request handling, we only return
> > errors
> > > on
> > > >> the Topic level, like topic not found. It seems that querying a
> > specific
> > > >> partition is not a valid use case. Will update.
> > > >> Thanks
> > > >>
> > > >> On Tue, Oct 10, 2023 at 11:55 AM Jun Rao 
> > > >> wrote:
> > > >>
> > > >> > Hi, Calvin,
> > > >> >
> > > >> > 60.  If the range query has errors for some of the partitions, do
> we
> > > >> expect
> > > >> > different responses when querying particular partitions?
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jun
> > > >> >
> > > >> > On Tue, Oct 10, 2023 at 10:50 AM Calvin Liu
> > >  > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Jun
> > > >> > > 60. Yes, it is a good question. I was thinking the API could be
> > > >> flexible
> > > >> > to
> > > >> > > query the particular partitions if the range query has errors
> for
> > > >> some of
> > > >> > > the partitions. Not sure whether it is a valid assumption, what
> do
> > > you
> > > >> > > think?
> > > >> > >
> > > >> > > 61. Good point, I will update them to partition level with the
> > same
> > > >> > limit.
> > > >> > >
> > > >> > > 62. Sure, will do.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > > On Tue, Oct 10, 2023 at 10:12 AM Jun Rao
>  > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hi, Calvin,
> > > >> > > >
> > > >> > > > A few more minor comments on your latest update.
> > > >> > > >
> > > >> > > > 60. DescribeTopicRequest: When will the Partitions field be
> > used?
> > > It
> > > >> > > seems
> > > >> > > > that the FirstPartitionId field is enough for AdminClient
> usage.
> > > >> > > >
> > > >> > > > 61. Could we make the limit for DescribeTopicRequest,
> > > >> > > ElectLeadersRequest,
> > > >> > > > GetReplicaLogInfo consistent? Currently, ElectLeadersRequest's
> > > >> limit is
> > > >> > > at
> > > >> > > > topic level and GetReplicaLogInfo has a different partition
> > level
> > > >> limit
> > > >> > > > from DescribeTopicRequest.
> > > >> > > >
> > > >> > > > 62. Should ElectLeadersRequest.DesiredLeaders be at the same
> > level
> > > >> as
> > > >> > > > ElectLeadersRequest.TopicPartitions.Partitions? In the KIP, it
> > > looks
> > > >> > like
> > > >> > > > it's at the same level as ElectLeadersRequest.TopicPartitions.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jun
> > > >> > > >
> > > >> > > > On Wed, Oct 4, 2023 at 3:55 PM Calvin Liu
> > > >> 
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > 

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-10-11 Thread David Arthur
One thing we should consider is a static config to totally enable/disable
the ELR feature. If I understand the KIP correctly, we can effectively
disable the unclean recovery by setting the recovery strategy config to
"none".

This would make development and rollout of this feature a bit smoother.
Consider the case that we find bugs in ELR after a cluster has updated to
its MetadataVersion. It's simpler to disable the feature through config
rather than going through a MetadataVersion downgrade (once that's
supported).

Does that make sense?

-David

On Wed, Oct 11, 2023 at 1:40 PM Calvin Liu 
wrote:

> Hi Jun
> -Good catch, yes, we don't need the -1 in the DescribeTopicRequest.
> -No new value is added. The LeaderRecoveryState will still be set to 1 if
> we have an unclean leader election. The unclean leader election includes
> the old random way and the unclean recovery. During the unclean recovery,
> the LeaderRecoveryState will not change until the controller decides to
> update the records with the new leader.
> Thanks
>
> On Wed, Oct 11, 2023 at 9:02 AM Jun Rao  wrote:
>
> > Hi, Calvin,
> >
> > Another thing. Currently, when there is an unclean leader election, we
> set
> > the LeaderRecoveryState in PartitionRecord and PartitionChangeRecord to
> 1.
> > With the KIP, will there be new values for LeaderRecoveryState? If not,
> > when will LeaderRecoveryState be set to 1?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Oct 10, 2023 at 4:24 PM Jun Rao  wrote:
> >
> > > Hi, Calvin,
> > >
> > > One more comment.
> > >
> > > "The first partition to fetch details for. -1 means to fetch all
> > > partitions." It seems that FirstPartitionId of 0 naturally means
> fetching
> > > all partitions?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Oct 10, 2023 at 12:40 PM Calvin Liu  >
> > > wrote:
> > >
> > >> Hi Jun,
> > >> Yeah, with the current Metadata request handling, we only return
> errors
> > on
> > >> the Topic level, like topic not found. It seems that querying a
> specific
> > >> partition is not a valid use case. Will update.
> > >> Thanks
> > >>
> > >> On Tue, Oct 10, 2023 at 11:55 AM Jun Rao 
> > >> wrote:
> > >>
> > >> > Hi, Calvin,
> > >> >
> > >> > 60.  If the range query has errors for some of the partitions, do we
> > >> expect
> > >> > different responses when querying particular partitions?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> > On Tue, Oct 10, 2023 at 10:50 AM Calvin Liu
> >  > >> >
> > >> > wrote:
> > >> >
> > >> > > Hi Jun
> > >> > > 60. Yes, it is a good question. I was thinking the API could be
> > >> flexible
> > >> > to
> > >> > > query the particular partitions if the range query has errors for
> > >> some of
> > >> > > the partitions. Not sure whether it is a valid assumption, what do
> > you
> > >> > > think?
> > >> > >
> > >> > > 61. Good point, I will update them to partition level with the
> same
> > >> > limit.
> > >> > >
> > >> > > 62. Sure, will do.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > On Tue, Oct 10, 2023 at 10:12 AM Jun Rao  >
> > >> > wrote:
> > >> > >
> > >> > > > Hi, Calvin,
> > >> > > >
> > >> > > > A few more minor comments on your latest update.
> > >> > > >
> > >> > > > 60. DescribeTopicRequest: When will the Partitions field be
> used?
> > It
> > >> > > seems
> > >> > > > that the FirstPartitionId field is enough for AdminClient usage.
> > >> > > >
> > >> > > > 61. Could we make the limit for DescribeTopicRequest,
> > >> > > ElectLeadersRequest,
> > >> > > > GetReplicaLogInfo consistent? Currently, ElectLeadersRequest's
> > >> limit is
> > >> > > at
> > >> > > > topic level and GetReplicaLogInfo has a different partition
> level
> > >> limit
> > >> > > > from DescribeTopicRequest.
> > >> > > >
> > >> > > > 62. Should ElectLeadersRequest.DesiredLeaders be at the same
> level
> > >> as
> > >> > > > ElectLeadersRequest.TopicPartitions.Partitions? In the KIP, it
> > looks
> > >> > like
> > >> > > > it's at the same level as ElectLeadersRequest.TopicPartitions.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jun
> > >> > > >
> > >> > > > On Wed, Oct 4, 2023 at 3:55 PM Calvin Liu
> > >> 
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi David,
> > >> > > > > Thanks for the comments.
> > >> > > > > 
> > >> > > > > I thought that a new snapshot with the downgraded MV is
> created
> > in
> > >> > this
> > >> > > > > case. Isn’t it the case?
> > >> > > > > Yes, you are right, a metadata delta will be generated after
> the
> > >> MV
> > >> > > > > downgrade. Then the user can start the software downgrade.
> > >> > > > > -
> > >> > > > > Could you also elaborate a bit more on the reasoning behind
> > adding
> > >> > the
> > >> > > > > limits to the admin RPCs? This is a new pattern in Kafka so it
> > >> would
> > >> > be
> > >> > > > > good to clear on the motivation.
> > >> > > > > Thanks to Colin for bringing it up. The current
> MetadataRequest
> > >> does
> > >> > > not
> > >> > > > > have a limit on the number 

[jira] [Resolved] (KAFKA-15571) StateRestoreListener#onRestoreSuspended is never called because wrapper DelegatingStateRestoreListener doesn't implement onRestoreSuspended

2023-10-11 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-15571.

Resolution: Fixed

> StateRestoreListener#onRestoreSuspended is never called because wrapper 
> DelegatingStateRestoreListener doesn't implement onRestoreSuspended
> ---
>
> Key: KAFKA-15571
> URL: https://issues.apache.org/jira/browse/KAFKA-15571
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.5.0, 3.6.0, 3.5.1
>Reporter: Levani Kokhreidze
>Assignee: Levani Kokhreidze
>Priority: Major
>
> With https://issues.apache.org/jira/browse/KAFKA-10575 
> `StateRestoreListener#onRestoreSuspended` was added. But local tests show 
> that it is never called because `DelegatingStateRestoreListener` was not 
> updated to call a new method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-714: Client metrics and observability

2023-10-11 Thread Sophie Blee-Goldman
This looks great! +1 (binding)

Sophie

On Wed, Oct 11, 2023 at 1:46 PM Matthias J. Sax  wrote:

> +1 (binding)
>
> On 9/13/23 5:48 PM, Jason Gustafson wrote:
> > Hey Andrew,
> >
> > +1 on the KIP. For many users of Kafka, it may not be fully understood
> how
> > much of a challenge client monitoring is. With tens of clients in a
> > cluster, it is already difficult to coordinate metrics collection. When
> > there are thousands of clients, and when the cluster operator has no
> > control over them, it is essentially impossible. For the fat clients that
> > we have, the lack of useful telemetry is a huge operational gap.
> > Consistency between clients has also been a major challenge. I think the
> > effort toward standardization in this KIP will have some positive impact
> > even in deployments which have effective client-side monitoring.
> Overall, I
> > think this proposal will provide a lot of value across the board.
> >
> > Best,
> > Jason
> >
> > On Wed, Sep 13, 2023 at 9:50 AM Philip Nee  wrote:
> >
> >> Hey Andrew -
> >>
> >> Thank you for taking the time to reply to my questions. I'm just adding
> >> some notes to this discussion.
> >>
> >> 1. epoch: It can be helpful to know the delta of the client side and the
> >> actual leader epoch.  It is helpful to understand why sometimes commit
> >> fails/client not making progress.
> >> 2. Client connection: If the client selects the "wrong" connection to
> push
> >> out the data, I assume the request would timeout; which should lead to
> >> disconnecting from the node and reselecting another node as you
> mentioned,
> >> via the least loaded node.
> >>
> >> Cheers,
> >> P
> >>
> >>
> >> On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
> >> andrew_schofield_j...@outlook.com> wrote:
> >>
> >>> Hi Philip,
> >>> Thanks for your vote and interest in the KIP.
> >>>
> >>> KIP-714 does not introduce any new client metrics, and that’s
> >> intentional.
> >>> It does
> >>> tell how that all of the client metrics can have their names
> transformed
> >>> into
> >>> equivalent "telemetry metric names”, and then potentially used in
> metrics
> >>> subscriptions.
> >>>
> >>> I am interested in the idea of client’s leader epoch in this context,
> but
> >>> I don’t have
> >>> an immediate plan for how best to do this, and it would take another
> KIP
> >>> to enhance
> >>> existing metrics or introduce some new ones. Those would then naturally
> >> be
> >>> applicable to the metrics push introduced in KIP-714.
> >>>
> >>> In a similar vein, there are no existing client metrics specifically
> for
> >>> auto-commit.
> >>> We could add them to Kafka, but I really think this is just an example
> of
> >>> asynchronous
> >>> commit in which the application has decided not to specify when the
> >> commit
> >>> should
> >>> begin.
> >>>
> >>> It is possible to increase the cadence of pushing by modifying the
> >>> interval.ms
> >>> configuration property of the CLIENT_METRICS resource.
> >>>
> >>> There is an “assigned-partitions” metric for each consumer, but not one
> >> for
> >>> active partitions. We could add one, again as a follow-on KIP.
> >>>
> >>> I take your point about holding on to a connection in a channel which
> >> might
> >>> experience congestion. Do you have a suggestion for how to improve on
> >> this?
> >>> For example, the client does have the concept of a least-loaded node.
> >> Maybe
> >>> this is something we should investigate in the implementation and
> decide
> >>> on the
> >>> best approach. In general, I think sticking with the same node for
> >>> consecutive
> >>> pushes is best, but if you choose the “wrong” node to start with, it’s
> >> not
> >>> ideal.
> >>>
> >>> Thanks,
> >>> Andrew
> >>>
>  On 8 Sep 2023, at 19:29, Philip Nee  wrote:
> 
>  Hey Andrew -
> 
>  +1 but I don't have a binding vote!
> 
>  It took me a while to go through the KIP. Here are some of my notes
> >>> during
>  the reading:
> 
>  *Metrics*
>  - Should we care about the client's leader epoch? There is a case
> where
> >>> the
>  user recreates the topic, but the consumer thinks it is still the same
>  topic and therefore, attempts to start from an offset that doesn't
> >> exist.
>  KIP-848 addresses this issue, but I can still see some potential
> >> benefits
>  from knowing the client's epoch information.
>  - I assume poll idle is similar to poll interval: I needed to read the
>  description a few times.
>  - I don't have a clear use case in mind for the commit latency, but I
> >> do
>  think sometimes people lack clarity about how much progress was
> tracked
> >>> by
>  the auto-commit.  Would tracking auto-commit-related metrics be
> >> useful? I
>  was thinking: the last offset committed or the actual cadence in ms.
>  - Are there cases when we need to increase the cadence of telemetry
> >> data
>  push? i.e. variable interval.
>  - Thanks for implementing the randomized 

[jira] [Resolved] (KAFKA-15589) Flaky kafka.server.FetchRequestTest

2023-10-11 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-15589.

Resolution: Duplicate

Duplicate of https://issues.apache.org/jira/browse/KAFKA-15566

> Flaky  kafka.server.FetchRequestTest
> 
>
> Key: KAFKA-15589
> URL: https://issues.apache.org/jira/browse/KAFKA-15589
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Priority: Major
> Attachments: image-2023-10-11-13-19-37-012.png
>
>
> I've been seeing a lot of test failures recently for  
> kafka.server.FetchRequestTest
> Specifically: !image-2023-10-11-13-19-37-012.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15590) Replica.updateFetchState should also fence updates with stale leader epoch

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15590:
--

 Summary: Replica.updateFetchState should also fence updates with 
stale leader epoch
 Key: KAFKA-15590
 URL: https://issues.apache.org/jira/browse/KAFKA-15590
 Project: Kafka
  Issue Type: Bug
Reporter: Calvin Liu


This is a follow-up ticket for KAFKA-15221.

There is another type of race that a fetch request with stale leader epoch can 
update the fetch state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2274

2023-10-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 316383 lines...]
Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldVerifyIfPendingTaskToRecycleExist() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldVerifyIfPendingTaskToRecycleExist() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToUpdateInputPartitions() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToUpdateInputPartitions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
onlyRemovePendingTaskToCloseDirtyShouldRemoveTaskFromPendingUpdateActions() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
onlyRemovePendingTaskToCloseDirtyShouldRemoveTaskFromPendingUpdateActions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToSuspend() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToSuspend() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldVerifyIfPendingTaskToInitExist() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldVerifyIfPendingTaskToInitExist() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldDrainPendingTasksToCreate() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldDrainPendingTasksToCreate() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldKeepAddedTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > TasksTest > 
shouldKeepAddedTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > StateQueryResultTest 
> Zero query results shouldn't error STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > StateQueryResultTest 
> Zero query results shouldn't error PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > StateQueryResultTest 
> Valid query results still works STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > StateQueryResultTest 
> Valid query results still works PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@3f09d7f6, 
org.apache.kafka.test.MockInternalProcessorContext@354eca53 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@3f09d7f6, 
org.apache.kafka.test.MockInternalProcessorContext@354eca53 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 77 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@778157d1, 
org.apache.kafka.test.MockInternalProcessorContext@53761d74 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 77 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@778157d1, 

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.6 #89

2023-10-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 204709 lines...]

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldDrainPendingTasksToCreate() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldDrainPendingTasksToCreate() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldKeepAddedTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldKeepAddedTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldAssignTasksThatCanBeSystemTimePunctuated() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldAssignTasksThatCanBeSystemTimePunctuated() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotUnassignNotOwnedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotUnassignNotOwnedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotSetUncaughtExceptionsTwice() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotSetUncaughtExceptionsTwice() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > 
shouldNotAssignTasksForPunctuationIfPunctuationDisabled() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > 
shouldNotAssignTasksForPunctuationIfPunctuationDisabled() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldAddTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldAddTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotAssignAnyLockedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotAssignAnyLockedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldRemoveTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldRemoveTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotRemoveAssignedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotRemoveAssignedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldAssignTaskThatCanBeProcessed() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldAssignTaskThatCanBeProcessed() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotRemoveUnlockedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldNotRemoveUnlockedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldReturnAndClearExceptionsOnDrainExceptions() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldReturnAndClearExceptionsOnDrainExceptions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldUnassignTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > shouldUnassignTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > 
shouldNotAssignTasksForProcessingIfProcessingDisabled() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskManagerTest > 

Re: [VOTE] KIP-714: Client metrics and observability

2023-10-11 Thread Matthias J. Sax

+1 (binding)

On 9/13/23 5:48 PM, Jason Gustafson wrote:

Hey Andrew,

+1 on the KIP. For many users of Kafka, it may not be fully understood how
much of a challenge client monitoring is. With tens of clients in a
cluster, it is already difficult to coordinate metrics collection. When
there are thousands of clients, and when the cluster operator has no
control over them, it is essentially impossible. For the fat clients that
we have, the lack of useful telemetry is a huge operational gap.
Consistency between clients has also been a major challenge. I think the
effort toward standardization in this KIP will have some positive impact
even in deployments which have effective client-side monitoring. Overall, I
think this proposal will provide a lot of value across the board.

Best,
Jason

On Wed, Sep 13, 2023 at 9:50 AM Philip Nee  wrote:


Hey Andrew -

Thank you for taking the time to reply to my questions. I'm just adding
some notes to this discussion.

1. epoch: It can be helpful to know the delta of the client side and the
actual leader epoch.  It is helpful to understand why sometimes commit
fails/client not making progress.
2. Client connection: If the client selects the "wrong" connection to push
out the data, I assume the request would timeout; which should lead to
disconnecting from the node and reselecting another node as you mentioned,
via the least loaded node.

Cheers,
P


On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:


Hi Philip,
Thanks for your vote and interest in the KIP.

KIP-714 does not introduce any new client metrics, and that’s

intentional.

It does
tell how that all of the client metrics can have their names transformed
into
equivalent "telemetry metric names”, and then potentially used in metrics
subscriptions.

I am interested in the idea of client’s leader epoch in this context, but
I don’t have
an immediate plan for how best to do this, and it would take another KIP
to enhance
existing metrics or introduce some new ones. Those would then naturally

be

applicable to the metrics push introduced in KIP-714.

In a similar vein, there are no existing client metrics specifically for
auto-commit.
We could add them to Kafka, but I really think this is just an example of
asynchronous
commit in which the application has decided not to specify when the

commit

should
begin.

It is possible to increase the cadence of pushing by modifying the
interval.ms
configuration property of the CLIENT_METRICS resource.

There is an “assigned-partitions” metric for each consumer, but not one

for

active partitions. We could add one, again as a follow-on KIP.

I take your point about holding on to a connection in a channel which

might

experience congestion. Do you have a suggestion for how to improve on

this?

For example, the client does have the concept of a least-loaded node.

Maybe

this is something we should investigate in the implementation and decide
on the
best approach. In general, I think sticking with the same node for
consecutive
pushes is best, but if you choose the “wrong” node to start with, it’s

not

ideal.

Thanks,
Andrew


On 8 Sep 2023, at 19:29, Philip Nee  wrote:

Hey Andrew -

+1 but I don't have a binding vote!

It took me a while to go through the KIP. Here are some of my notes

during

the reading:

*Metrics*
- Should we care about the client's leader epoch? There is a case where

the

user recreates the topic, but the consumer thinks it is still the same
topic and therefore, attempts to start from an offset that doesn't

exist.

KIP-848 addresses this issue, but I can still see some potential

benefits

from knowing the client's epoch information.
- I assume poll idle is similar to poll interval: I needed to read the
description a few times.
- I don't have a clear use case in mind for the commit latency, but I

do

think sometimes people lack clarity about how much progress was tracked

by

the auto-commit.  Would tracking auto-commit-related metrics be

useful? I

was thinking: the last offset committed or the actual cadence in ms.
- Are there cases when we need to increase the cadence of telemetry

data

push? i.e. variable interval.
- Thanks for implementing the randomized initial metric push; I think

it

is

really important.
- Is there a potential use case for tracking the number of active
partitions? The consumer can pause partitions via API, during

revocation,

or during offset reset for the stream.

*Connections*:
- The KIP stated that it will keep the same connection until the

connection

is disconnected. I wonder if that could potentially cause congestion if

it

is already a busy channel, which leads to connection timeout and
subsequently disconnection.

Thanks,
P

On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:


Bumping the voting thread for KIP-714.

So far, we have:
Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)

Thanks,
Andrew


On 4 Aug 2023, at 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Matthias J. Sax

Thanks!

On 10/10/23 11:31 PM, Andrew Schofield wrote:

Matthias,
Yes, I think that’s a sensible way forward and the interface you propose looks 
good. I’ll update the KIP accordingly.

Thanks,
Andrew


On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to be 
important, as we don't know if/when a follow-up KIP for Kafka Streams would 
land.

I was also thinking (and discussed with a few others) how to expose it, and we 
would propose the following:

We add a new method to `KafkaStreams` class:

public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

  public class ClientsInstanceIds {
// we only have a single admin client per KS instance
String adminInstanceId();

// we only have a single global consumer per KS instance (if any)
// Optional<> because we might not have global-thread
Optional globalConsumerInstanceId();

// return a  ClientInstanceId> mapping
// for the underlying (restore-)consumers/producers
Map mainConsumerInstanceIds();
Map restoreConsumerInstanceIds();
Map producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

  [Stream|StateUpdater]Thread-


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good point. Makes sense to me.
Is this something that can also be included in the proposed Kafka Streams 
follow-on KIP, or would you prefer that I add it to KIP-714?
I have a slight preference for the former to put all of the KS enhancements 
into a separate KIP.
Thanks,
Andrew

On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:

Thanks Andrew. SGTM.

One point you did not address is the idea to add a method to `KafkaStreams` 
similar to the proposed `clientInstanceId()` that will be added to 
consumer/producer/admin clients.

Without addressing this, Kafka Streams users won't have a way to get the 
assigned `instanceId` of the internally created clients, and thus it would be 
very difficult for them to know which metrics that the broker receives belong 
to a Kafka Streams app. It seems they would only find the `instanceIds` in the 
log4j output if they enable client logging?

Of course, because there is multiple clients inside Kafka Streams, the return type cannot be an 
single "String", but must be some some complex data structure -- we could either add 
a new class, or return a Map using a client key that maps to the 
`instanceId`.

For example we could use the following key:

   [Global]StreamThread[-][-restore][consumer|producer]

(Of course, only the valid combination.)

Or maybe even better, we might want to return a `Future` because collection all 
the `instanceId` might be a blocking all on each client? I have already a few 
idea how it could be implemented but I don't think it must be discussed on the 
KIP, as it's an implementation detail.

Thoughts?


-Matthias

On 10/6/23 4:21 AM, Andrew Schofield wrote:

Hi Matthias,
Thanks for your comments. I agree that a follow-up KIP for Kafka Streams makes 
sense. This KIP currently has made a bit
of an effort to embrace KS, but it’s not enough by a long way.
I have removed `application.id `. This should be done 
properly in the follow-up KIP. I don’t believe there’s a downside to
removing it from this KIP.
I have reworded the statement about temporarily. In practice, the 
implementation of this KIP that’s going on while the voting
progresses happens to use delta temporality, but that’s an implementation 
detail. Supporting clients must support both
temporalities.
I thought about exposing the client instance ID as a metric, but non-numeric 
metrics are not usual practice and tools
do not universally support them. I don’t think the KIP is improved by adding 
one now.
I have also added constants for the various Config classes for 
ENABLE_METRICS_PUSH_CONFIG, including to
StreamsConfig. It’s best to be explicit about this.
Thanks,
Andrew

On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:

Hi,

I did not pay attention to this KIP in the past; seems it was on-hold for a 
while.

Overall it sounds very useful, and I think we should extend this with a follow 
up KIP for Kafka Streams. What is unclear to me at this point is the statement:


Kafka Streams applications have an application.id configured and this 
identifier should be included as the application_id metrics label.


The `application.id` is currently only used as the (main) consumer's `group.id` 
(and is part of an auto-generated `client.id` if the user does not set one).

This comment related to:


The following labels should be added by the client as appropriate before 
metrics are pushed.


Given that Kafka Streams uses the consumer/producer/admin client as "black 
boxes", a client does at this point not know that it's part of a Kafka Streams 
application, and thus, it won't be able to attach any such label to the metrics 

[jira] [Created] (KAFKA-15589) Flaky kafka.server.FetchRequestTest

2023-10-11 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-15589:
--

 Summary: Flaky  kafka.server.FetchRequestTest
 Key: KAFKA-15589
 URL: https://issues.apache.org/jira/browse/KAFKA-15589
 Project: Kafka
  Issue Type: Task
Reporter: Justine Olshan
 Attachments: image-2023-10-11-13-19-37-012.png

I've been seeing a lot of test failures recently for  
kafka.server.FetchRequestTest

Specifically: !image-2023-10-11-13-19-37-012.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15588) Purge the unsent offset commits/fetches when the member is fenced

2023-10-11 Thread Philip Nee (Jira)
Philip Nee created KAFKA-15588:
--

 Summary: Purge the unsent offset commits/fetches when the member 
is fenced
 Key: KAFKA-15588
 URL: https://issues.apache.org/jira/browse/KAFKA-15588
 Project: Kafka
  Issue Type: Bug
Reporter: Philip Nee
Assignee: Philip Nee


When the member is fenced/failed, we should purge the inflight offset commits 
and fetches.  HeartbeatRequestManager should be able to handle this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.6 #88

2023-10-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 307216 lines...]
streams-9: SMOKE-TEST-CLIENT-CLOSED
streams-7: SMOKE-TEST-CLIENT-CLOSED
streams-2: SMOKE-TEST-CLIENT-CLOSED
streams-4: SMOKE-TEST-CLIENT-CLOSED
streams-0: SMOKE-TEST-CLIENT-CLOSED
streams-5: SMOKE-TEST-CLIENT-CLOSED
streams-6: SMOKE-TEST-CLIENT-CLOSED
streams-8: SMOKE-TEST-CLIENT-CLOSED
streams-7: SMOKE-TEST-CLIENT-CLOSED
streams-10: SMOKE-TEST-CLIENT-CLOSED
streams-5: SMOKE-TEST-CLIENT-CLOSED
> Task :core:compileTestScala

Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings 
and determine if they come from your own scripts or plugins.

For more on this, please refer to 
https://docs.gradle.org/8.2.1/userguide/command_line_interface.html#sec:command_line_warnings
 in the Gradle documentation.

BUILD SUCCESSFUL in 2h 54m 7s
296 actionable tasks: 109 executed, 187 up-to-date

Publishing build scan...
https://ge.apache.org/s/ijbvxa37sbwji


See the profiling report at: 
file:///home/jenkins/workspace/Kafka_kafka_3.6/build/reports/profile/profile-2023-10-11-15-26-49.html
A fine-grained performance profile is available: use the --scan option.
[Pipeline] junit
Recording test results
[Checks API] No suitable checks publisher found.
[Pipeline] echo
Skipping Kafka Streams archetype test for Java 20
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
[Pipeline] // timeout
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
> Task :core:testClasses
> Task :streams:compileTestJava
> Task :streams:testClasses
> Task :streams:testJar
> Task :streams:testSrcJar
> Task :streams:publishMavenJavaPublicationToMavenLocal
> Task :streams:publishToMavenLocal

Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings 
and determine if they come from your own scripts or plugins.

For more on this, please refer to 
https://docs.gradle.org/8.2.1/userguide/command_line_interface.html#sec:command_line_warnings
 in the Gradle documentation.

BUILD SUCCESSFUL in 5m 40s
94 actionable tasks: 41 executed, 53 up-to-date

Publishing build scan...
https://ge.apache.org/s/l5xwfuuso3h44

[Pipeline] sh
+ grep ^version= gradle.properties
+ cut -d= -f 2
[Pipeline] dir
Running in 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/streams/quickstart
[Pipeline] {
[Pipeline] sh
+ mvn clean install -Dgpg.skip
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Kafka Streams :: Quickstart[pom]
[INFO] streams-quickstart-java[maven-archetype]
[INFO] 
[INFO] < org.apache.kafka:streams-quickstart >-
[INFO] Building Kafka Streams :: Quickstart 3.6.1-SNAPSHOT[1/2]
[INFO]   from pom.xml
[INFO] [ pom ]-
[INFO] 
[INFO] --- clean:3.0.0:clean (default-clean) @ streams-quickstart ---
[INFO] 
[INFO] --- remote-resources:1.5:process (process-resource-bundles) @ 
streams-quickstart ---
[INFO] 
[INFO] --- site:3.5.1:attach-descriptor (attach-descriptor) @ 
streams-quickstart ---
[INFO] 
[INFO] --- gpg:1.6:sign (sign-artifacts) @ streams-quickstart ---
[INFO] 
[INFO] --- install:2.5.2:install (default-install) @ streams-quickstart ---
[INFO] Installing 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.6/streams/quickstart/pom.xml
 to 
/home/jenkins/.m2/repository/org/apache/kafka/streams-quickstart/3.6.1-SNAPSHOT/streams-quickstart-3.6.1-SNAPSHOT.pom
[INFO] 
[INFO] --< org.apache.kafka:streams-quickstart-java >--
[INFO] Building streams-quickstart-java 3.6.1-SNAPSHOT[2/2]
[INFO]   from java/pom.xml
[INFO] --[ maven-archetype ]---
[INFO] 
[INFO] --- clean:3.0.0:clean (default-clean) @ streams-quickstart-java ---
[INFO] 
[INFO] --- remote-resources:1.5:process (process-resource-bundles) @ 
streams-quickstart-java ---
[INFO] 
[INFO] --- resources:2.7:resources (default-resources) @ 
streams-quickstart-java ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 6 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- resources:2.7:testResources (default-testResources) @ 
streams-quickstart-java ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- archetype:2.2:jar (default-jar) @ streams-quickstart-java ---
[INFO] Building archetype jar: 

[jira] [Created] (KAFKA-15587) Limit the number of records per batch in GroupCoordinatorShard#cleanupGroupMetadata

2023-10-11 Thread Jeff Kim (Jira)
Jeff Kim created KAFKA-15587:


 Summary: Limit the number of records per batch in 
GroupCoordinatorShard#cleanupGroupMetadata
 Key: KAFKA-15587
 URL: https://issues.apache.org/jira/browse/KAFKA-15587
 Project: Kafka
  Issue Type: Task
Reporter: Jeff Kim


The existing group/offset expiration may generate an unbounded number of 
records. This also applies to the new coordinator 
([https://github.com/apache/kafka/pull/14467).|https://github.com/apache/kafka/pull/14467)]

 

We should limit the number of records per batch. One caveat is that other API 
results assume their records are appended atomically, so we need to keep this 
in mind if we implement the change in CoordinatorPartitionWriter#append. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15586) Clean shutdown detection, server side

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15586:
--

 Summary: Clean shutdown detection, server side
 Key: KAFKA-15586
 URL: https://issues.apache.org/jira/browse/KAFKA-15586
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


Upon the broker registration, if the broker has an unclean shutdown, it should 
be removed from all the ELRs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15585) DescribeTopic API

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15585:
--

 Summary: DescribeTopic API
 Key: KAFKA-15585
 URL: https://issues.apache.org/jira/browse/KAFKA-15585
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


Adding the new DescribeTopic API + the admin client and server-side handling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

2023-10-11 Thread Chris Egerton
Hi all,

Thanks for the votes! I'll cast a final +1 myself and close the vote out.

This KIP passes with the following +1 votes (and no +0 or -1 votes):

• Greg Harris (binding)
• Yash Mayya (binding)
• Federico Valeri
• Mickael Maison (binding)
• hudeqi
• Hector Geraldino
• Chris Egerton (binding, author)

I've begun implementing the feature and plan on publishing the first PR
sometime this week or the next.

Cheers,

Chris

On Mon, Oct 9, 2023 at 2:32 PM Hector Geraldino (BLOOMBERG/ 919 3RD A) <
hgerald...@bloomberg.net> wrote:

> Good stuff, +1 (non-binding) from me as well
>
> De: dev@kafka.apache.org A: 10/09/23 05:16:06 UTC-4:00A:
> dev@kafka.apache.org
> Subject: Re: [VOTE] KIP-976: Cluster-wide dynamic log adjustment for Kafka
> Connect
>
> Hi Chris,
>
> +1 (non binding)
>
> Thanks
> Fede
>
> On Sun, Oct 8, 2023 at 10:11 AM Yash Mayya  wrote:
> >
> > Hi Chris,
> >
> > Thanks for the KIP!
> > +1 (binding)
> >
> > Yash
> >
> > On Fri, Oct 6, 2023 at 9:54 PM Greg Harris  >
> > wrote:
> >
> > > Hey Chris,
> > >
> > > Thanks for the KIP!
> > > I think that preserving the ephemeral nature of the logging change is
> > > the right choice here, and using the config topic for intra-cluster
> > > broadcast is better than REST forwarding.
> > >
> > > +1 (binding)
> > >
> > > Thanks,
> > > Greg
> > >
> > > On Fri, Oct 6, 2023 at 9:05 AM Chris Egerton 
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I'd like to call for a vote on KIP-976, which augments the existing
> > > dynamic
> > > > logger adjustment REST API for Kafka Connect to apply changes
> > > cluster-wide
> > > > instead on a per-worker basis.
> > > >
> > > > The KIP:
> > > >
> > >
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976:+Cluster-wide+dynamic+
> log+adjustment+for+Kafka+Connect
> > > >
> > > > The discussion thread:
> > > > https://lists.apache.org/thread/w3x3f3jmyd1vfjxho06y8xgt6mhhzpl5
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > >
>
>
>


[jira] [Created] (KAFKA-15584) ELR leader election

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15584:
--

 Summary: ELR leader election
 Key: KAFKA-15584
 URL: https://issues.apache.org/jira/browse/KAFKA-15584
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


With the ELR, here are the changes related to the leader election:
 * ISR is allowed to be empty.
 * ELR can be elected when ISR is empty
 * When ISR and ELR are both empty, the lastKnownLeader can be uncleanly 
elected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15583) High watermark can only advance if ISR size is larger than min ISR

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15583:
--

 Summary: High watermark can only advance if ISR size is larger 
than min ISR
 Key: KAFKA-15583
 URL: https://issues.apache.org/jira/browse/KAFKA-15583
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


This is the new high watermark advancement requirement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15582) Clean shutdown detection, broker side

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15582:
--

 Summary: Clean shutdown detection, broker side
 Key: KAFKA-15582
 URL: https://issues.apache.org/jira/browse/KAFKA-15582
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


The clean shutdown file can now include the broker epoch before shutdown. 
During the broker start process, the broker should extract the broker epochs 
from the clean shutdown files. If successful, send the broker epoch through the 
broker registration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15581) Introduce ELR

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15581:
--

 Summary: Introduce ELR
 Key: KAFKA-15581
 URL: https://issues.apache.org/jira/browse/KAFKA-15581
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


Introduce the PartitionRecord, PartitionChangeRecord and the basic ELR handling 
in the controller



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15580) KIP-966: Unclean Recovery

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15580:
--

 Summary: KIP-966: Unclean Recovery
 Key: KAFKA-15580
 URL: https://issues.apache.org/jira/browse/KAFKA-15580
 Project: Kafka
  Issue Type: New Feature
Reporter: Calvin Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15579) KIP-966: Eligible Leader Replicas

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15579:
--

 Summary: KIP-966: Eligible Leader Replicas
 Key: KAFKA-15579
 URL: https://issues.apache.org/jira/browse/KAFKA-15579
 Project: Kafka
  Issue Type: New Feature
Reporter: Calvin Liu
Assignee: Calvin Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-960: Support single-key_single-timestamp interactive queries (IQv2) for versioned state stores

2023-10-11 Thread Walker Carlson
+1 (binding)

Thanks for the kip Alieh!

Walker

On Wed, Oct 11, 2023 at 3:52 AM Bruno Cadonna  wrote:

> Thanks for the KIP, Alieh!
>
> +1 (binding)
>
> Best,
> Bruno
>
> On 10/10/23 1:14 AM, Matthias J. Sax wrote:
> > One more nit: as discussed on the related KIP-698 thread, we should not
> > use `get` as prefix for the getters.
> >
> > So it should be `K key()` and `Optional asOfTimestamp()`.
> >
> >
> > Otherwise the KIP LGTM.
> >
> >
> > +1 (binding)
> >
> >
> > -Matthias
> >
> > On 10/6/23 2:50 AM, Alieh Saeedi wrote:
> >> Hi everyone,
> >>
> >> Since KIP-960 is reduced to the simplest IQ type and all further
> comments
> >> are related to the following-up KIPs, I decided to finalize it at this
> >> point.
> >>
> >>
> >> A huge thank you to everyone who has reviewed this KIP (and also the
> >> following-up ones), and
> >> participated in the discussion thread!
> >>
> >> I'd also like to thank you in advance for taking the time to vote.
> >>
> >> Best,
> >> Alieh
> >>
>


[jira] [Created] (KAFKA-15578) Run System Tests for Old protocol in the New Coordinator

2023-10-11 Thread Ritika Muduganti (Jira)
Ritika Muduganti created KAFKA-15578:


 Summary: Run System Tests for Old protocol in the New Coordinator
 Key: KAFKA-15578
 URL: https://issues.apache.org/jira/browse/KAFKA-15578
 Project: Kafka
  Issue Type: Sub-task
Reporter: Ritika Muduganti
Assignee: Ritika Muduganti


Change existing system tests related to the consumer group protocol and group 
coordinator to test the old protocol running with the new coordinator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE]KIP-966: Eligible Leader Replicas

2023-10-11 Thread Calvin Liu
Hi,
The KIP has received 3 binding votes from Justine Olshan, Jun Rao and Colin
McCabe.
Also thanks to Jeff Kim, Jack Vanlightly, Artem Livshits, David Arthur,
David Jacot for the comments to make the KIP better.
Thanks for the help!

On Mon, Oct 9, 2023 at 10:20 AM Calvin Liu  wrote:

> Hi Colin,
> Thanks for the feedback. I have updated the KIP but with the following
> changes.
> --The request is still grouped by topics. For each topic, the caller can
> specify either partition IDs or a range of partitions to query. If it is a
> range query, the request should specify the "first partition id", then the
> partition larger or equal to the id will be returned.
> --The response is still grouped by topics. When quota limit reached:
>  If it is a range request, the "next partition id" will be specified
> within the topic when a partial of partitions can be returned. When the
> whole topic can't be returned, the topic will have the
> error REQUEST_LIMIT_REACHED.
>  If it is a partition-specific topic, all the partition can't be
> returned will have error REQUEST_LIMIT_REACHED
> Note, that the request can have partition-specific and range-request topic
> mixed.
>
> On Fri, Oct 6, 2023 at 4:30 PM Colin McCabe  wrote:
>
>> Hi Calvin,
>>
>> Thanks for the KIP. I think the config discussion was good and I have no
>> more comments there.
>>
>> I have one last thing I think we should fix up:
>>
>> I think we should improve DescribeTopicRequest. The current mechanism of
>> "you can only list 20 topics" doesn't do a very good job of limiting the
>> results. After all, if those topics only have 1 partition each, this means
>> a pretty small RPC. If they have 10,000 partitions each, then it's a very
>> large RPC.
>>
>> I think a better mechanism would be:
>> 1. Have the request be a list of (topic_name, partition_id) pairs plus a
>> (first_topic_name, first_partition_id) pair.
>> (for the initial request, first_topic_name="" and first_partition_id=-1,
>> of course)
>> (if partition_id = -1 then we should list all partitions for the topic)
>>
>> 2. When returning results, sort everything alphabetically and return the
>> first 1000, plus a (next_topic, next_partition_id) pair. (if there is
>> nothing more to return, next_topic = null.)
>>
>> With those changes I would be +1
>>
>> best,
>> Colin
>>
>>
>> If the response wasn't long enough, the caller can set
>> On Wed, Oct 4, 2023, at 17:44, Jun Rao wrote:
>> > Hi, Calvin,
>> >
>> > Thanks for the KIP. +1 from me too.
>> >
>> > Jun
>> >
>> > On Wed, Sep 20, 2023 at 5:28 PM Justine Olshan
>> 
>> > wrote:
>> >
>> >> Thanks Calvin.
>> >> I think this will be very helpful going forward to minimize data loss.
>> >>
>> >> +1 from me (binding)
>> >>
>> >> Justine
>> >>
>> >> On Wed, Sep 20, 2023 at 3:42 PM Calvin Liu > >
>> >> wrote:
>> >>
>> >> > Hi all,
>> >> > I'd like to call for a vote on KIP-966 which includes a series of
>> >> > enhancements to the current ISR model.
>> >> >
>> >> >- Introduce the new HWM advancement requirement which enables the
>> >> system
>> >> >to have more potentially data-safe replicas.
>> >> >- Introduce Eligible Leader Replicas(ELR) to represent the above
>> >> >data-safe replicas.
>> >> >- Introduce Unclean Recovery process which will deterministically
>> >> choose
>> >> >the best replica during an unclean leader election.
>> >> >
>> >> >
>> >> > KIP:
>> >> >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
>> >> >
>> >> > Discussion thread:
>> >> > https://lists.apache.org/thread/gpbpx9kpd7c62dm962h6kww0ghgznb38
>> >> >
>> >>
>>
>


[jira] [Resolved] (KAFKA-15221) Potential race condition between requests from rebooted followers

2023-10-11 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-15221.
-
Fix Version/s: (was: 3.5.2)
   Resolution: Fixed

> Potential race condition between requests from rebooted followers
> -
>
> Key: KAFKA-15221
> URL: https://issues.apache.org/jira/browse/KAFKA-15221
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.5.0
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Blocker
> Fix For: 3.7.0
>
>
> When the leader processes the fetch request, it does not acquire locks when 
> updating the replica fetch state. Then there can be a race between the fetch 
> requests from a rebooted follower.
> T0, broker 1 sends a fetch to broker 0(leader). At the moment, broker 1 is 
> not in ISR.
> T1, broker 1 crashes.
> T2 broker 1 is back online and receives a new broker epoch. Also, it sends a 
> new Fetch request.
> T3 broker 0 receives the old fetch requests and decides to expand the ISR.
> T4 Right before broker 0 starts to fill the AlterPartitoin request, the new 
> fetch request comes in and overwrites the fetch state. Then broker 0 uses the 
> new broker epoch on the AlterPartition request.
> In this way, the AlterPartition request can get around KIP-903 and wrongly 
> update the ISR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2273

2023-10-11 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-10-11 Thread Calvin Liu
Hi Jun
-Good catch, yes, we don't need the -1 in the DescribeTopicRequest.
-No new value is added. The LeaderRecoveryState will still be set to 1 if
we have an unclean leader election. The unclean leader election includes
the old random way and the unclean recovery. During the unclean recovery,
the LeaderRecoveryState will not change until the controller decides to
update the records with the new leader.
Thanks

On Wed, Oct 11, 2023 at 9:02 AM Jun Rao  wrote:

> Hi, Calvin,
>
> Another thing. Currently, when there is an unclean leader election, we set
> the LeaderRecoveryState in PartitionRecord and PartitionChangeRecord to 1.
> With the KIP, will there be new values for LeaderRecoveryState? If not,
> when will LeaderRecoveryState be set to 1?
>
> Thanks,
>
> Jun
>
> On Tue, Oct 10, 2023 at 4:24 PM Jun Rao  wrote:
>
> > Hi, Calvin,
> >
> > One more comment.
> >
> > "The first partition to fetch details for. -1 means to fetch all
> > partitions." It seems that FirstPartitionId of 0 naturally means fetching
> > all partitions?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Oct 10, 2023 at 12:40 PM Calvin Liu 
> > wrote:
> >
> >> Hi Jun,
> >> Yeah, with the current Metadata request handling, we only return errors
> on
> >> the Topic level, like topic not found. It seems that querying a specific
> >> partition is not a valid use case. Will update.
> >> Thanks
> >>
> >> On Tue, Oct 10, 2023 at 11:55 AM Jun Rao 
> >> wrote:
> >>
> >> > Hi, Calvin,
> >> >
> >> > 60.  If the range query has errors for some of the partitions, do we
> >> expect
> >> > different responses when querying particular partitions?
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > On Tue, Oct 10, 2023 at 10:50 AM Calvin Liu
>  >> >
> >> > wrote:
> >> >
> >> > > Hi Jun
> >> > > 60. Yes, it is a good question. I was thinking the API could be
> >> flexible
> >> > to
> >> > > query the particular partitions if the range query has errors for
> >> some of
> >> > > the partitions. Not sure whether it is a valid assumption, what do
> you
> >> > > think?
> >> > >
> >> > > 61. Good point, I will update them to partition level with the same
> >> > limit.
> >> > >
> >> > > 62. Sure, will do.
> >> > >
> >> > > Thanks
> >> > >
> >> > > On Tue, Oct 10, 2023 at 10:12 AM Jun Rao 
> >> > wrote:
> >> > >
> >> > > > Hi, Calvin,
> >> > > >
> >> > > > A few more minor comments on your latest update.
> >> > > >
> >> > > > 60. DescribeTopicRequest: When will the Partitions field be used?
> It
> >> > > seems
> >> > > > that the FirstPartitionId field is enough for AdminClient usage.
> >> > > >
> >> > > > 61. Could we make the limit for DescribeTopicRequest,
> >> > > ElectLeadersRequest,
> >> > > > GetReplicaLogInfo consistent? Currently, ElectLeadersRequest's
> >> limit is
> >> > > at
> >> > > > topic level and GetReplicaLogInfo has a different partition level
> >> limit
> >> > > > from DescribeTopicRequest.
> >> > > >
> >> > > > 62. Should ElectLeadersRequest.DesiredLeaders be at the same level
> >> as
> >> > > > ElectLeadersRequest.TopicPartitions.Partitions? In the KIP, it
> looks
> >> > like
> >> > > > it's at the same level as ElectLeadersRequest.TopicPartitions.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jun
> >> > > >
> >> > > > On Wed, Oct 4, 2023 at 3:55 PM Calvin Liu
> >> 
> >> > > > wrote:
> >> > > >
> >> > > > > Hi David,
> >> > > > > Thanks for the comments.
> >> > > > > 
> >> > > > > I thought that a new snapshot with the downgraded MV is created
> in
> >> > this
> >> > > > > case. Isn’t it the case?
> >> > > > > Yes, you are right, a metadata delta will be generated after the
> >> MV
> >> > > > > downgrade. Then the user can start the software downgrade.
> >> > > > > -
> >> > > > > Could you also elaborate a bit more on the reasoning behind
> adding
> >> > the
> >> > > > > limits to the admin RPCs? This is a new pattern in Kafka so it
> >> would
> >> > be
> >> > > > > good to clear on the motivation.
> >> > > > > Thanks to Colin for bringing it up. The current MetadataRequest
> >> does
> >> > > not
> >> > > > > have a limit on the number of topics to query in a single
> request.
> >> > > > Massive
> >> > > > > requests can mess up the JVM. We want to have some sort of
> >> throttle
> >> > on
> >> > > > the
> >> > > > > new APIs.
> >> > > > > -
> >> > > > > Could you also explain how the client is supposed to handle the
> >> > > > > topics/partitions above the limit? I suppose that it will have
> to
> >> > retry
> >> > > > > those, correct?
> >> > > > > Corrent. For the official admin clients, it will split the large
> >> > > request
> >> > > > > into proper pieces and query one after another.
> >> > > > > -
> >> > > > > My understanding is that the topics/partitions above the limit
> >> will
> >> > be
> >> > > > > failed with an invalid exception error. I wonder if this choice
> is
> >> > > > > judicious because the invalide request exception is usually
> >> fatal. It
> >> > > may
> >> > > > > be better to use an new and 

Re: [DISCUSS] KIP-975 Docker Image for Apache Kafka

2023-10-11 Thread Vedarth Sharma
Hey folks!

We've incorporated all the points discussed into the KIP. Additionally,
we've included a section detailing the Release Process of the Docker image.
We value your feedback, so please take some time to review it. Once we have
your inputs, we can proceed with the voting process.

Thanks and regards,
Vedarth


On Fri, Sep 8, 2023 at 9:58 AM Krishna Agarwal 
wrote:

>
> > Hi,
> > Apache Kafka does not have an official docker image currently.
> > I want to submit a KIP to publish a docker image for Apache Kafka.
> >
> > KIP-975: Docker Image for Apache Kafka
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka
> > >
> >
> > Regards,
> > Krishna
> >
>

On Thu, Sep 14, 2023 at 1:44 PM Viktor Somogyi-Vass
 wrote:

> Hi Krishna,
>
> I think you should merge this KIP and KIP-974 as there are overlaps as
> Federico pointed out on KIP-974. I think you should keep that one as it
> has well defined goals (improve tests) while I feel this one is too
> generic. Docker is usually just a tool for either testing or Kubernetes, so
> they have very well defined use-cases. In the case of Flink for instance
> the image is used for its kubernetes operator. The use case would determine
> a lot of things and I think a generic image would likely not fit the needs
> of all use-cases.
>
> Best,
> Viktor
>
> On Fri, Sep 8, 2023 at 9:58 AM Krishna Agarwal <
> krishna0608agar...@gmail.com>
> wrote:
>
> > Hi,
> > Apache Kafka does not have an official docker image currently.
> > I want to submit a KIP to publish a docker image for Apache Kafka.
> >
> > KIP-975: Docker Image for Apache Kafka
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka
> > >
> >
> > Regards,
> > Krishna
> >
>


Re: [DISCUSS] Road to Kafka 4.0

2023-10-11 Thread Christopher Shannon
I think JBOD definitely needs to be before 4.0. That has been a blocker
issue this entire time for me and my team and I'm sure others. While Kraft
has been technically "production ready" for a while, I haven't been able to
upgrade because of missing JBOD support.

On Wed, Oct 11, 2023 at 12:15 PM Ismael Juma  wrote:

> Hi Luke,
>
> This is a good discussion. And there is a lot more to it than KRaft.
>
> With regards to KRaft, there are two separate items:
> 1. Bugs
> 2. Missing features when compared to ZK
>
> When it comes to bugs, I don't see why 4.0 is particularly relevant. KRaft
> has been considered production-ready for over a year. If the bug is truly
> critical, we should fix it for 3.6.1 or 3.7.0 (depending on the
> complexity).
>
> When it comes to missing features, it would be preferable to land them
> before 4.0 as well (ideally 3.7). I believe KIP-858 (JBOD) is the obvious
> one in this category, but there are a few more in your list worth
> discussing.
>
> Ismael
>
> On Wed, Oct 11, 2023 at 5:18 AM Luke Chen  wrote:
>
> > Hi all,
> >
> > While Kafka 3.6.0 is released, I’d like to start the discussion for the
> > “road to Kafka 4.0”. Based on the plan in KIP-833
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7
> > >,
> > the next release 3.7 will be the final release before moving to Kafka 4.0
> > to remove the Zookeeper from Kafka. Before making this major change, I'd
> > like to get consensus on the "must-have features/fixes for Kafka 4.0", to
> > avoid some users being surprised when upgrading to Kafka 4.0. The intent
> is
> > to have a clear communication about what to expect in the following
> months.
> > In particular we should be signaling what features and configurations are
> > not supported, or at risk (if no one is able to add support or fix known
> > bugs).
> >
> > Here is the JIRA tickets list
> > 
> I
> > labeled for "4.0-blocker". The criteria I labeled as “4.0-blocker” are:
> > 1. The feature is supported in Zookeeper Mode, but not supported in KRaft
> > mode, yet (ex: KIP-858: JBOD in KRaft)
> > 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split brain in KRaft
> > controller quorum)
> >
> > If you disagree with my current list, welcome to have discussion in the
> > specific JIRA ticket. Or, if you think there are some tickets I missed,
> > welcome to start a discussion in the JIRA ticket and ping me or other
> > people. After we get the consensus, we can label/unlabel it afterwards.
> > Again, the goal is to have an open communication with the community about
> > what will be coming in 4.0.
> >
> > Below is the high level category of the list content:
> >
> > 1. Recovery from disk failure
> > KIP-856
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery
> > >:
> > KRaft Disk Failure Recovery
> >
> > 2. Prevote to support controllers more than 3
> > KIP-650
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics
> > >:
> > Enhance Kafkaesque Raft semantics
> >
> > 3. JBOD support
> > KIP-858
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
> > >:
> > Handle
> > JBOD broker disk failure in KRaft
> >
> > 4. Scale up/down Controllers
> > KIP-853
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes
> > >:
> > KRaft Controller Membership Changes
> >
> > 5. Modifying dynamic configurations on the KRaft controller
> >
> > 6. Critical bugs in KRaft
> >
> > Does this make sense?
> > Any feedback is welcomed.
> >
> > Thank you.
> > Luke
> >
>


Re: [DISCUSS] KIP-988 Streams Standby Task Update Listener

2023-10-11 Thread Sophie Blee-Goldman
Hey Colt! Thanks for the KIP -- this will be a great addition to Streams, I
can't believe we've gone so long without this.

Overall the proposal makes sense, but I had a handful of fairly minor
questions and suggestions/requests

1. Seems like the last sentence in the 2nd paragraph of the Motivation
section is cut off and incomplete -- "want to be able to know " what
exactly?

2. This isn't that important since the motivation as a whole is clear to me
and convincing enough, but I'm not quite sure I understand the example at
the end of the Motivation section. How are standby tasks (and the ability
to hook into and monitor their status) related to the session.timeout.ms
config?

3. To help both old and new users of Kafka Streams understand this new
restore listener and its purpose/semantics, can we try to name the class and
 callbacks in a way that's more consistent with the active task restore
listener?

3a. StandbyTaskUpdateListener:
The existing restore listener is called StateRestoreListener, so the new
one could be called something like StandbyStateRestoreListener. Although
we typically refer to standby tasks as "processing" rather than "restoring"
records -- ie restoration is a term for active task state specifically. I
actually
like the original suggestion if we just drop the "Task" part of the name,
ie StandbyUpdateListener. I think either that or StandbyRestoreListener
would be fine and probably the two best options.
Also, this probably goes without saying but any change to the name of this
class should of course be reflected in the KafkaStreams#setXXX API as well

3b. #onTaskCreated
 I know the "start" callback feels a bit different for the standby task
updater vs an active task beginning restoration, but I think we should try
to
keep the various callbacks aligned to their active restore listener
counterpart. We can/should just replace the term "restore" with "update"
for the
callback method names the same way we do for the class name, which in this
case would give us #onUpdateStart. Personally I like this better,
but it's ultimately up to you. However, I would push back against anything
that includes the word "Task" (eg #onTaskCreated) as the listener
 is actually not scoped to the task itself but instead to the individual
state store(s). This is the main reason I would prefer calling it something
like #onUpdateStart, which keeps the focus on the store being updated
rather than the task that just happens to own this store
One last thing on this callback -- do we really need both the
`earliestOffset` and `startingOffset`? I feel like this might be more
confusing than it
is helpful (tbh even I'm not completely sure I know what the earliestOffset
is supposed to represent) More importantly, is this all information
that is already available and able to be passed in to the callback by
Streams? I haven't checked on this but it feels like the earliestOffset is
likely to require a remote call, either by the embedded consumer or via the
admin client. If so, the ROI on including this parameter seems
quite low (if not outright negative)

3c. #onBatchRestored
If we opt to use the term "update" in place of "restore" elsewhere, then we
should consider doing so here as well. What do you think about
#onBatchUpdated, or even #onBatchProcessed?
I'm actually not super concerned about this particular API, and honestly I
think we can use restore or update interchangeably here, so if you
 don't like any of the suggested names (and no one can think of anything
better), I would just stick with #onBatchRestored. In this case,
it kind of makes the most sense.

3d. #onTaskSuspended
Along the same lines as 3b above, #onUpdateSuspended or just
#onRestoreSuspended probably makes more sense for this callback. Also,
 I notice the StateRestoreListener passes in the total number of records
restored to its #onRestoreSuspended. Assuming we already track
that information in Streams and have it readily available to pass in at
whatever point we would be invoking this callback, that might be a
useful  parameter for the standby listener to have as well

4. I totally love the SuspendReason thing, just two notes/requests:

4a. Feel free to push back against adding onto the scope of this KIP, but
it would be great to expand the active state restore listener with this
SuspendReason enum as well. It would be really useful for both variants of
restore listener

4b. Assuming we do 4a, let's rename PROMOTED to RECYCLED -- for standby
tasks it means basically the same thing, the point is that active
tasks can also be recycled into standbys through the same mechanism. This
way they can share the SuspendReason enum -- not that it's
necessary for them to share, I just think it would be a good idea to keep
the two restore listeners aligned to the highest degree possible for as
we can.
I was actually considering proposing a short KIP with a new
RecyclingListener (or something) specifically for this exact kind of thing,
since we
currently have 

Re: [DISCUSS] Road to Kafka 4.0

2023-10-11 Thread Ismael Juma
Hi Luke,

This is a good discussion. And there is a lot more to it than KRaft.

With regards to KRaft, there are two separate items:
1. Bugs
2. Missing features when compared to ZK

When it comes to bugs, I don't see why 4.0 is particularly relevant. KRaft
has been considered production-ready for over a year. If the bug is truly
critical, we should fix it for 3.6.1 or 3.7.0 (depending on the complexity).

When it comes to missing features, it would be preferable to land them
before 4.0 as well (ideally 3.7). I believe KIP-858 (JBOD) is the obvious
one in this category, but there are a few more in your list worth
discussing.

Ismael

On Wed, Oct 11, 2023 at 5:18 AM Luke Chen  wrote:

> Hi all,
>
> While Kafka 3.6.0 is released, I’d like to start the discussion for the
> “road to Kafka 4.0”. Based on the plan in KIP-833
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7
> >,
> the next release 3.7 will be the final release before moving to Kafka 4.0
> to remove the Zookeeper from Kafka. Before making this major change, I'd
> like to get consensus on the "must-have features/fixes for Kafka 4.0", to
> avoid some users being surprised when upgrading to Kafka 4.0. The intent is
> to have a clear communication about what to expect in the following months.
> In particular we should be signaling what features and configurations are
> not supported, or at risk (if no one is able to add support or fix known
> bugs).
>
> Here is the JIRA tickets list
>  I
> labeled for "4.0-blocker". The criteria I labeled as “4.0-blocker” are:
> 1. The feature is supported in Zookeeper Mode, but not supported in KRaft
> mode, yet (ex: KIP-858: JBOD in KRaft)
> 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split brain in KRaft
> controller quorum)
>
> If you disagree with my current list, welcome to have discussion in the
> specific JIRA ticket. Or, if you think there are some tickets I missed,
> welcome to start a discussion in the JIRA ticket and ping me or other
> people. After we get the consensus, we can label/unlabel it afterwards.
> Again, the goal is to have an open communication with the community about
> what will be coming in 4.0.
>
> Below is the high level category of the list content:
>
> 1. Recovery from disk failure
> KIP-856
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery
> >:
> KRaft Disk Failure Recovery
>
> 2. Prevote to support controllers more than 3
> KIP-650
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics
> >:
> Enhance Kafkaesque Raft semantics
>
> 3. JBOD support
> KIP-858
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
> >:
> Handle
> JBOD broker disk failure in KRaft
>
> 4. Scale up/down Controllers
> KIP-853
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes
> >:
> KRaft Controller Membership Changes
>
> 5. Modifying dynamic configurations on the KRaft controller
>
> 6. Critical bugs in KRaft
>
> Does this make sense?
> Any feedback is welcomed.
>
> Thank you.
> Luke
>


Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-10-11 Thread Jun Rao
Hi, Calvin,

Another thing. Currently, when there is an unclean leader election, we set
the LeaderRecoveryState in PartitionRecord and PartitionChangeRecord to 1.
With the KIP, will there be new values for LeaderRecoveryState? If not,
when will LeaderRecoveryState be set to 1?

Thanks,

Jun

On Tue, Oct 10, 2023 at 4:24 PM Jun Rao  wrote:

> Hi, Calvin,
>
> One more comment.
>
> "The first partition to fetch details for. -1 means to fetch all
> partitions." It seems that FirstPartitionId of 0 naturally means fetching
> all partitions?
>
> Thanks,
>
> Jun
>
> On Tue, Oct 10, 2023 at 12:40 PM Calvin Liu 
> wrote:
>
>> Hi Jun,
>> Yeah, with the current Metadata request handling, we only return errors on
>> the Topic level, like topic not found. It seems that querying a specific
>> partition is not a valid use case. Will update.
>> Thanks
>>
>> On Tue, Oct 10, 2023 at 11:55 AM Jun Rao 
>> wrote:
>>
>> > Hi, Calvin,
>> >
>> > 60.  If the range query has errors for some of the partitions, do we
>> expect
>> > different responses when querying particular partitions?
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Tue, Oct 10, 2023 at 10:50 AM Calvin Liu > >
>> > wrote:
>> >
>> > > Hi Jun
>> > > 60. Yes, it is a good question. I was thinking the API could be
>> flexible
>> > to
>> > > query the particular partitions if the range query has errors for
>> some of
>> > > the partitions. Not sure whether it is a valid assumption, what do you
>> > > think?
>> > >
>> > > 61. Good point, I will update them to partition level with the same
>> > limit.
>> > >
>> > > 62. Sure, will do.
>> > >
>> > > Thanks
>> > >
>> > > On Tue, Oct 10, 2023 at 10:12 AM Jun Rao 
>> > wrote:
>> > >
>> > > > Hi, Calvin,
>> > > >
>> > > > A few more minor comments on your latest update.
>> > > >
>> > > > 60. DescribeTopicRequest: When will the Partitions field be used? It
>> > > seems
>> > > > that the FirstPartitionId field is enough for AdminClient usage.
>> > > >
>> > > > 61. Could we make the limit for DescribeTopicRequest,
>> > > ElectLeadersRequest,
>> > > > GetReplicaLogInfo consistent? Currently, ElectLeadersRequest's
>> limit is
>> > > at
>> > > > topic level and GetReplicaLogInfo has a different partition level
>> limit
>> > > > from DescribeTopicRequest.
>> > > >
>> > > > 62. Should ElectLeadersRequest.DesiredLeaders be at the same level
>> as
>> > > > ElectLeadersRequest.TopicPartitions.Partitions? In the KIP, it looks
>> > like
>> > > > it's at the same level as ElectLeadersRequest.TopicPartitions.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jun
>> > > >
>> > > > On Wed, Oct 4, 2023 at 3:55 PM Calvin Liu
>> 
>> > > > wrote:
>> > > >
>> > > > > Hi David,
>> > > > > Thanks for the comments.
>> > > > > 
>> > > > > I thought that a new snapshot with the downgraded MV is created in
>> > this
>> > > > > case. Isn’t it the case?
>> > > > > Yes, you are right, a metadata delta will be generated after the
>> MV
>> > > > > downgrade. Then the user can start the software downgrade.
>> > > > > -
>> > > > > Could you also elaborate a bit more on the reasoning behind adding
>> > the
>> > > > > limits to the admin RPCs? This is a new pattern in Kafka so it
>> would
>> > be
>> > > > > good to clear on the motivation.
>> > > > > Thanks to Colin for bringing it up. The current MetadataRequest
>> does
>> > > not
>> > > > > have a limit on the number of topics to query in a single request.
>> > > > Massive
>> > > > > requests can mess up the JVM. We want to have some sort of
>> throttle
>> > on
>> > > > the
>> > > > > new APIs.
>> > > > > -
>> > > > > Could you also explain how the client is supposed to handle the
>> > > > > topics/partitions above the limit? I suppose that it will have to
>> > retry
>> > > > > those, correct?
>> > > > > Corrent. For the official admin clients, it will split the large
>> > > request
>> > > > > into proper pieces and query one after another.
>> > > > > -
>> > > > > My understanding is that the topics/partitions above the limit
>> will
>> > be
>> > > > > failed with an invalid exception error. I wonder if this choice is
>> > > > > judicious because the invalide request exception is usually
>> fatal. It
>> > > may
>> > > > > be better to use an new and explicit error for this case.
>> > > > >
>> > > > > Thanks for bringing this up. How about "REQUEST_LIMIT_REACHED"?
>> > > > > 
>> > > > > It seems that we still need to specify the changes to the admin
>> api
>> > to
>> > > > > accommodate the new or updated apis. Do you plan to add them?
>> > > > > Try to cover the following
>> > > > > 1. The admin client will use the new DescribeTopicRequest to query
>> > the
>> > > > > topics
>> > > > > 2. Mention the API limit and the new retriable error.
>> > > > > 3. Output changes for the admin client when describing a topic
>> (new
>> > > > fields
>> > > > > of ELR...)
>> > > > > 4. Changes to data structures like TopicPartitionInfo to include
>> the
>> > > ELR.
>> > > > > Anything else I missed?
>> > > > >
>> 

Re: [ANNOUNCE] Apache Kafka 3.6.0

2023-10-11 Thread Chris Egerton
Thanks Satish for all your hard work as release manager, and thanks to
everyone else for their contributions!

On Wed, Oct 11, 2023 at 6:07 AM Mickael Maison 
wrote:

> Thanks Satish and to everyone who contributed to this release!
>
> Mickael
>
>
> On Wed, Oct 11, 2023 at 11:09 AM Divij Vaidya 
> wrote:
> >
> > Thank you for all the hard work, Satish.
> >
> > Many years after the KIP-405 was written, we have it implemented and
> > finally available for beta testing for the users. It's a big milestone in
> > 3.6.0. Kudos again to you for driving it to this milestone. I am looking
> > forward to hearing the feedback from users so that we can fix the paper
> > cuts in 3.7.0.
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Wed, Oct 11, 2023 at 9:32 AM Viktor Somogyi-Vass
> >  wrote:
> >
> > > Thanks for the release Satish! :)
> > >
> > > On Wed, Oct 11, 2023, 09:30 Bruno Cadonna  wrote:
> > >
> > > > Thanks for the release, Satish!
> > > >
> > > > Best,
> > > > Bruno
> > > >
> > > > On 10/11/23 8:29 AM, Luke Chen wrote:
> > > > > Thanks for running the release, Satish!
> > > > >
> > > > > BTW, 3.6.0 should be a major release, not a minor one. :)
> > > > >
> > > > > Luke
> > > > >
> > > > > On Wed, Oct 11, 2023 at 1:39 PM Satish Duggana  >
> > > > wrote:
> > > > >
> > > > >> The Apache Kafka community is pleased to announce the release for
> > > > >> Apache Kafka 3.6.0
> > > > >>
> > > > >> This is a minor release and it includes fixes and improvements
> from
> > > 238
> > > > >> JIRAs.
> > > > >>
> > > > >> All of the changes in this release can be found in the release
> notes:
> > > > >> https://www.apache.org/dist/kafka/3.6.0/RELEASE_NOTES.html
> > > > >>
> > > > >> An overview of the release can be found in our announcement blog
> post:
> > > > >> https://kafka.apache.org/blog
> > > > >>
> > > > >> You can download the source and binary release (Scala 2.12 and
> Scala
> > > > 2.13)
> > > > >> from:
> > > > >> https://kafka.apache.org/downloads#3.6.0
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> ---
> > > > >>
> > > > >>
> > > > >> Apache Kafka is a distributed streaming platform with four core
> APIs:
> > > > >>
> > > > >>
> > > > >> ** The Producer API allows an application to publish a stream of
> > > > records to
> > > > >> one or more Kafka topics.
> > > > >>
> > > > >> ** The Consumer API allows an application to subscribe to one or
> more
> > > > >> topics and process the stream of records produced to them.
> > > > >>
> > > > >> ** The Streams API allows an application to act as a stream
> processor,
> > > > >> consuming an input stream from one or more topics and producing an
> > > > >> output stream to one or more output topics, effectively
> transforming
> > > the
> > > > >> input streams to output streams.
> > > > >>
> > > > >> ** The Connector API allows building and running reusable
> producers or
> > > > >> consumers that connect Kafka topics to existing applications or
> data
> > > > >> systems. For example, a connector to a relational database might
> > > > >> capture every change to a table.
> > > > >>
> > > > >>
> > > > >> With these APIs, Kafka can be used for two broad classes of
> > > application:
> > > > >>
> > > > >> ** Building real-time streaming data pipelines that reliably get
> data
> > > > >> between systems or applications.
> > > > >>
> > > > >> ** Building real-time streaming applications that transform or
> react
> > > > >> to the streams of data.
> > > > >>
> > > > >>
> > > > >> Apache Kafka is in use at large and small companies worldwide,
> > > including
> > > > >> Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest,
> > > Rabobank,
> > > > >> Target, The New York Times, Uber, Yelp, and Zalando, among others.
> > > > >>
> > > > >> A big thank you for the following 139 contributors to this
> release!
> > > > >> (Please report an unintended omission)
> > > > >>
> > > > >> This was a community effort, so thank you to everyone who
> contributed
> > > > >> to this release, including all our users and our 139 contributors:
> > > > >> A. Sophie Blee-Goldman, Aaron Ai, Abhijeet Kumar, aindriu-aiven,
> > > > >> Akhilesh Chaganti, Alexandre Dupriez, Alexandre Garnier, Alok
> > > > >> Thatikunta, Alyssa Huang, Aman Singh, Andras Katona, Andrew
> Schofield,
> > > > >> Andrew Grant, Aneel Kumar, Anton Agestam, Artem Livshits,
> atu-sharm,
> > > > >> bachmanity1, Bill Bejeck, Bo Gao, Bruno Cadonna, Calvin Liu,
> Chaitanya
> > > > >> Mukka, Chase Thomas, Cheryl Simmons, Chia-Ping Tsai, Chris
> Egerton,
> > > > >> Christo Lolov, Clay Johnson, Colin P. McCabe, Colt McNealy,
> d00791190,
> > > > >> Damon Xie, Danica Fine, Daniel Scanteianu, Daniel Urban, David
> Arthur,
> > > > >> David Jacot, David Mao, dengziming, Deqi Hu, Dimitar Dimitrov,
> Divij
> > > > >> Vaidya, DL1231, Dániel Urbán, Erik van Oosten, ezio, Farooq
> Qaiser,
> > > > >> Federico Valeri, flashmouse, Florin 

Re: [DISCUSS] KIP-932: Queues for Kafka

2023-10-11 Thread Andrew Schofield
Hi Jack,
Thanks for your comments.

I have added a new section on Log Retention which describes the behaviour of 
the SPSO as the LSO advances. That makes total sense
and was an omission from the KIP.

I have added the other ideas as potential future work. I do like the idea of 
having the SPSO influence the advancements of the LSO
for topics which are primarily being using with share groups.

I have published an updated version of the KIP.

Thanks,
Andrew

> On 4 Oct 2023, at 10:09, Jack Vanlightly  wrote:
> 
> I would like to see more explicit discussion of topic retention and share 
> groups. There are a few options here from simple to more sophisticated. There 
> are also topic-level and share-group level options.
> 
> The simple thing would be to ensure that the SPSO of each share group is 
> bounded by the Log Start Offset (LSO) of each partition which itself is 
> managed by the retention policy. This is a topic-level control which applies 
> to all share-groups. I would say that this shared retention is the largest 
> drawback of modeling queues on shared logs and this is worth noting.
> 
> More sophisticated approaches can be to allow the LSO to advance not (only) 
> by retention policy but by the advancement of the lowest SPSO. This can keep 
> the amount of data lower by garbage collecting messages that have been 
> acknowledged by all share groups. Some people may like that behaviour on 
> those topics where share groups are the only consumption model and no replay 
> is needed.
> 
> There are per-share-group possibilities such as share-group TTLs where 
> messages can be archived on a per share group basis.
> 
> Thanks
> Jack



Re: Re: Re: [DISCUSS] KIP-972: Add the metric of the current running version of kafka

2023-10-11 Thread Doğuşcan Namal
Hello, do we have a metric showing the uptime? We could tag that metric
with version information as well.

I like the idea of adding the version as a tag as well. However, I am not
inclined to tag each metric with a KafkaVersion information. We could
discuss which metrics could be tagged but let's keep that out of scope from
this discussion.

On Wed, 11 Oct 2023 at 07:37, Sophie Blee-Goldman 
wrote:

> Just to chime in here since I recently went through a similar thing, I
> support adding the version
> as a tag instead of introducing an entirely new metric for this. In fact I
> just implemented exactly this
> in a project that uses Kafka, for these reasons:
>
> 1. Adding the version as a tag means that all metrics which are already
> collected will benefit, and lets you easily tell
> at a glance which version a specific client metric corresponds to. This is
> incredibly useful when looking at a dashboard
> covering multiple instances from different sources. For example, imagine a
> graph that plots the performance (eg bytes
> consumed rate) of many individual consumers and which shows several of them
> maxing out much lower than the rest.
> If the metric is tagged with the version already, you can easily check if
> the slow consumers are all using a specific version
> and may be displaying a performance regression. If the version info has to
> be plotted separately as its own metric, this is
> much more of a hassle to check.
> 2. Additional metrics can be expensive, but additional tags are almost
> always free (at least, that is my understanding)
> 3. As you guys already discussed, many systems (like Prometheus) require
> numeric values, and it's pretty much impossible
> to come up with a readable scheme for all the relevant versioning info --
> even if we removed the dots we're left with a rather
> unreadable representation of the version and of course will need to solve
> the "-SNAPSHOT" issue somehow. But beyond that,
> in addition to the raw version we also wanted to emit the specific commit
> id, which really needs to be a string.
>
> I'm pretty sure Kafka client metrics also include the commit id in addition
> to the version. If we add the version to the tags,
> we should consider adding the commit id as well. This is incredibly useful
> for intermediate/SNAPSHOT versions, which
> don't uniquely identify the specific code that is running.
>
> I would personally love to see a KIP start tagging the existing metrics
> with the version info, and it sounds like this would also
> solve your problem in a very natural way
>
> On Tue, Oct 10, 2023 at 5:42 AM Mickael Maison 
> wrote:
>
> > Hi Hudeqi,
> >
> > Rather than creating a gauge with a dummy value, could we add the
> > version (and commitId) as tags to an existing metric.
> > For example, the alongside the existing Version and CommitId metrics
> > we have StartTimeMs. Maybe we can have a StartTimeMs metrics with the
> > version and commitId) as tags on it? The existing metric already has
> > the brokerid (id) as tag. WDYT?
> >
> > Thanks,
> > Mickael
> >
> > On Thu, Aug 31, 2023 at 4:59 AM hudeqi <16120...@bjtu.edu.cn> wrote:
> > >
> > > Thank you for your answer, Mickael.
> > > If set the value of gauge to a constant value of 1, adding that tag key
> > is "version" and value is the version value of the obtained string type,
> > does this solve the problem? We can get the version by tag in prometheus.
> > >
> > > best,
> > > hudeqi
> > >
> > > Mickael Maison mickael.mai...@gmail.com写道:
> > > > Hi,
> > > >
> > > > Prometheus only support numeric values for metrics. This means it's
> > > > not able to handle the kafka.server:type=app-info metric since Kafka
> > > > versions are not valid numbers (3.5.0).
> > > > As a workaround we could create a metric with the version without the
> > > > dots, for example with value 350 for Kafka 3.5.0.
> > > >
> > > > Also in between releases Kafka uses the -SNAPSHOT suffix (for example
> > > > trunk is currently 3.7.0-SNAPSHOT) so we should also consider a way
> to
> > > > handle those.
> > > >
> > > > Thanks,
> > > > Mickael
> > > >
> > > > On Wed, Aug 30, 2023 at 2:51 PM hudeqi <16120...@bjtu.edu.cn> wrote:
> > > > >
> > > > > Hi, Kamal, thanks your reminding, but I have a question: It seems
> > that I can't get this metric through "jmx_prometheus"? Although I
> observed
> > this metric through other tools.
> > > > >
> > > > > best,
> > > > > hudeqi
> > > > >
> > > > > Kamal Chandraprakash &
> lt;kamal.chandraprak...@gmail.com
> > 写道:
> > > > > > Hi Hudeqi,
> > > > > >
> > > > > > Kafka already emits the version metric. Can you check whether the
> > below
> > > > > > metric satisfies your requirement?
> > > > > >
> > > > > > kafka.server:type=app-info,id=0
> > > > > >
> > > > > > --
> > > > > > Kamal
> > > > > >
> > > > > > On Mon, Aug 28, 2023 at 2:29 PM hudeqi <16120...@bjtu.edu.cn>
> > wrote:
> > > > > >
> > > > > > > Hi, all, I want to submit a minor kip to add a metric, which
> > supports to
> > > > > > > get 

Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2272

2023-10-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 420380 lines...]
Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldDrainPendingTasksToCreate() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldDrainPendingTasksToCreate() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldKeepAddedTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > TasksTest > 
shouldKeepAddedTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Zero query results shouldn't error STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Zero query results shouldn't error PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Valid query results still works STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > StateQueryResultTest 
> Valid query results still works PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@1c4c3327, 
org.apache.kafka.test.MockInternalProcessorContext@7e4ede72 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@1c4c3327, 
org.apache.kafka.test.MockInternalProcessorContext@7e4ede72 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@3d4e81fb, 
org.apache.kafka.test.MockInternalProcessorContext@2553486a STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@3d4e81fb, 
org.apache.kafka.test.MockInternalProcessorContext@2553486a PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@21eaea32, 
org.apache.kafka.test.MockInternalProcessorContext@21a24257 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@21eaea32, 
org.apache.kafka.test.MockInternalProcessorContext@21a24257 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@28be2758, 
org.apache.kafka.test.MockInternalProcessorContext@bfc20f2 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@28be2758, 
org.apache.kafka.test.MockInternalProcessorContext@bfc20f2 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCacheCapacity(RocksDBStore, StateStoreContext) > [1] 

[DISCUSS] Road to Kafka 4.0

2023-10-11 Thread Luke Chen
Hi all,

While Kafka 3.6.0 is released, I’d like to start the discussion for the
“road to Kafka 4.0”. Based on the plan in KIP-833
,
the next release 3.7 will be the final release before moving to Kafka 4.0
to remove the Zookeeper from Kafka. Before making this major change, I'd
like to get consensus on the "must-have features/fixes for Kafka 4.0", to
avoid some users being surprised when upgrading to Kafka 4.0. The intent is
to have a clear communication about what to expect in the following months.
In particular we should be signaling what features and configurations are
not supported, or at risk (if no one is able to add support or fix known
bugs).

Here is the JIRA tickets list
 I
labeled for "4.0-blocker". The criteria I labeled as “4.0-blocker” are:
1. The feature is supported in Zookeeper Mode, but not supported in KRaft
mode, yet (ex: KIP-858: JBOD in KRaft)
2. Critical bugs in KRaft, (ex: KAFKA-15489 : split brain in KRaft
controller quorum)

If you disagree with my current list, welcome to have discussion in the
specific JIRA ticket. Or, if you think there are some tickets I missed,
welcome to start a discussion in the JIRA ticket and ping me or other
people. After we get the consensus, we can label/unlabel it afterwards.
Again, the goal is to have an open communication with the community about
what will be coming in 4.0.

Below is the high level category of the list content:

1. Recovery from disk failure
KIP-856
:
KRaft Disk Failure Recovery

2. Prevote to support controllers more than 3
KIP-650
:
Enhance Kafkaesque Raft semantics

3. JBOD support
KIP-858
:
Handle
JBOD broker disk failure in KRaft

4. Scale up/down Controllers
KIP-853
:
KRaft Controller Membership Changes

5. Modifying dynamic configurations on the KRaft controller

6. Critical bugs in KRaft

Does this make sense?
Any feedback is welcomed.

Thank you.
Luke


Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-10-11 Thread Bruno Cadonna

Thanks for the updates, Alieh!

The example in the KIP uses the allVersions() method which we agreed to 
remove.


Regarding your questions:
1. asOf vs. until: I am fine with both but slightly prefer until.
2. What about KeyMultiVersionsQuery, KeyVersionsQuery (KIP-960 would 
then be KeyVersionQuery). However, I am also fine with 
MultiVersionedKeyQuery since none of the names sounds better or worse to 
me.
3. I agree with you not to introduce the method with the two bounds to 
keep things simple.

4. Forget about fromTime() an asOfTime(), from() and asOf() is fine.
5. The main purpose is to show how to use the API. Maybe make an example 
with just the key to distinguish this query from the single value query 
of KIP-960 and then one with a key and a time range. When you iterate 
over the results you could also call validTo(). Maybe add some actual 
records in the comments to show what the result might look like.


Regarding the test plan, I hope you also plan to add unit tests in all 
of your KIPs. Maybe you could also explain why system tests are not 
needed here.


Best,
Bruno

On 10/10/23 5:36 PM, Alieh Saeedi wrote:

Thank you all for the very exact and constructive comments. I really
enjoyed reading your ideas and all the important points you made me aware
of. I updated KIP-968 as follows:

1. If the key or time bounds are null, the method returns NPE.
2. The "valid" word: I removed the sentence "all the records that are
valid..." and replaced it with an exact explanation. More over, I explained
it with an example in the KIP but not in the javadocs. Do I need to add the
example to the javadocs as well?
3. Since I followed Bruno's suggestion and removed the allVersions()
method, the problem of meaningless combinations is solved, and I do not
need any IllegalArgumentException or something like that. Therefore, the
change is that if no time bound is specified, the query returns the records
with the specified key for all timestamps (all versions).
4. As Victoria suggested, adding a method to the *VersionedKeyValueStore
*interface is essential. So I did that. I had this method only in the
RocksDBVersionedStore class, which was not enough.
5. I added the *validTo* field to the VersionedRecord class to be able
to represent the tombstones. As you suggested, we postpone solving the
problem of retrieving consecutive tombstones for later.
6. I added the "Test Plan" section to all KIPs. I hope what I wrote is
convincing.
7. I added the *withAscendingTimestamp()* method to provide more
code readability
for the user.
8. I removed the evil word "get" from all getter methods.

There have also been some more suggestions which I am still not convinced
or clear about them:

1. Regarding asOf vs until: reading all comments, my conclusion was that
I keep it as "asOf" (following Walker's idea as the native speaker as well
as Bruno's suggestion to be consistent with single-key_single_ts queries).
But I do not have a personal preference. If you insist on "until", I change
it.
2. Bruno suggested renaming the class "MultiVersionedKeyQuery" to sth
else. We already had a long discussion about the name with Matthias. I am
open to renaming it to something else, but do you have any ideas?
3. Matthias suggested having a method with two input parameters that
enables the user to specify both time bounds in the same method. Isn't it
introducing redundancy? It is somehow disrespectful to the idea of having
composable methods.
4. Bruno suggested renaming the methods "asOf" and "from" to "asOfTime"
and "fromTime". If I do that, then it is not consistent with KIP-960.
Moreover, the input parameter is clearly a timestamp, which explains
enough. What do you think about that?
5. I was asked to add more examples to the example section. My question
is, what is the main purpose of that? If I know it clearly, then I can add
what you mean.



Cheers,
Alieh

On Tue, Oct 10, 2023 at 1:13 AM Matthias J. Sax  wrote:


Bruno and I had some background conversation about the `get` prefix
question including a few other committers.

The official policy was never changed, and we should not add the
`get`-prefix. It's a slip on our side in previous KIPs to add the
`get`-prefix and we should actually clean it up doing a follow up KIP.


-Matthias


On 10/5/23 5:26 AM, Bruno Cadonna wrote:

Hi Matthias,

Given all the IQv2 KIPs that use getX and given recent PRs (internal
interfaces mainly) that got merged, I was under the impression that we
moved away from the strict no-getX policy.

I do not think it was an accident using getX in the IQv2 KIPs since
somebody would have brought it up, otherwise.

I am fine with both types of getters.

If we think, we need to discuss this in a broader context, let's start a
separate thread.


Best,
Bruno





On 10/5/23 7:44 AM, Matthias J. Sax wrote:

I agree to 

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2271

2023-10-11 Thread Apache Jenkins Server
See 




Re: [REVIEW REQUEST] ReassignPartitionCommand scala -> java

2023-10-11 Thread Николай Ижиков
Hello, Team.

Just a friendly reminder.

Huge scope of work are waiting to be finished with this PR.
Please, join the review.

> 5 окт. 2023 г., в 16:49, Николай Ижиков  написал(а):
> 
> Hello.
> 
> PR got some review and approves from non committers.
> CI is OK.
> Dear committers, please, join the review  - 
> https://github.com/apache/kafka/pull/13247
> 
>> 3 окт. 2023 г., в 00:22, Николай Ижиков  написал(а):
>> 
>> Hello.
>> 
>> Thanks everyone for the help.
>> I think we come to the stage when command itself can be reviewed.
>> 
>> All tests and dependencies of ReassignPartitionCommand translated from scala 
>> to java in previous PR’s.
>> 
>> Please, join the review - https://github.com/apache/kafka/pull/13247
>> 
> 



Re: [ANNOUNCE] Apache Kafka 3.6.0

2023-10-11 Thread Mickael Maison
Thanks Satish and to everyone who contributed to this release!

Mickael


On Wed, Oct 11, 2023 at 11:09 AM Divij Vaidya  wrote:
>
> Thank you for all the hard work, Satish.
>
> Many years after the KIP-405 was written, we have it implemented and
> finally available for beta testing for the users. It's a big milestone in
> 3.6.0. Kudos again to you for driving it to this milestone. I am looking
> forward to hearing the feedback from users so that we can fix the paper
> cuts in 3.7.0.
>
> --
> Divij Vaidya
>
>
>
> On Wed, Oct 11, 2023 at 9:32 AM Viktor Somogyi-Vass
>  wrote:
>
> > Thanks for the release Satish! :)
> >
> > On Wed, Oct 11, 2023, 09:30 Bruno Cadonna  wrote:
> >
> > > Thanks for the release, Satish!
> > >
> > > Best,
> > > Bruno
> > >
> > > On 10/11/23 8:29 AM, Luke Chen wrote:
> > > > Thanks for running the release, Satish!
> > > >
> > > > BTW, 3.6.0 should be a major release, not a minor one. :)
> > > >
> > > > Luke
> > > >
> > > > On Wed, Oct 11, 2023 at 1:39 PM Satish Duggana 
> > > wrote:
> > > >
> > > >> The Apache Kafka community is pleased to announce the release for
> > > >> Apache Kafka 3.6.0
> > > >>
> > > >> This is a minor release and it includes fixes and improvements from
> > 238
> > > >> JIRAs.
> > > >>
> > > >> All of the changes in this release can be found in the release notes:
> > > >> https://www.apache.org/dist/kafka/3.6.0/RELEASE_NOTES.html
> > > >>
> > > >> An overview of the release can be found in our announcement blog post:
> > > >> https://kafka.apache.org/blog
> > > >>
> > > >> You can download the source and binary release (Scala 2.12 and Scala
> > > 2.13)
> > > >> from:
> > > >> https://kafka.apache.org/downloads#3.6.0
> > > >>
> > > >>
> > > >>
> > >
> > ---
> > > >>
> > > >>
> > > >> Apache Kafka is a distributed streaming platform with four core APIs:
> > > >>
> > > >>
> > > >> ** The Producer API allows an application to publish a stream of
> > > records to
> > > >> one or more Kafka topics.
> > > >>
> > > >> ** The Consumer API allows an application to subscribe to one or more
> > > >> topics and process the stream of records produced to them.
> > > >>
> > > >> ** The Streams API allows an application to act as a stream processor,
> > > >> consuming an input stream from one or more topics and producing an
> > > >> output stream to one or more output topics, effectively transforming
> > the
> > > >> input streams to output streams.
> > > >>
> > > >> ** The Connector API allows building and running reusable producers or
> > > >> consumers that connect Kafka topics to existing applications or data
> > > >> systems. For example, a connector to a relational database might
> > > >> capture every change to a table.
> > > >>
> > > >>
> > > >> With these APIs, Kafka can be used for two broad classes of
> > application:
> > > >>
> > > >> ** Building real-time streaming data pipelines that reliably get data
> > > >> between systems or applications.
> > > >>
> > > >> ** Building real-time streaming applications that transform or react
> > > >> to the streams of data.
> > > >>
> > > >>
> > > >> Apache Kafka is in use at large and small companies worldwide,
> > including
> > > >> Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest,
> > Rabobank,
> > > >> Target, The New York Times, Uber, Yelp, and Zalando, among others.
> > > >>
> > > >> A big thank you for the following 139 contributors to this release!
> > > >> (Please report an unintended omission)
> > > >>
> > > >> This was a community effort, so thank you to everyone who contributed
> > > >> to this release, including all our users and our 139 contributors:
> > > >> A. Sophie Blee-Goldman, Aaron Ai, Abhijeet Kumar, aindriu-aiven,
> > > >> Akhilesh Chaganti, Alexandre Dupriez, Alexandre Garnier, Alok
> > > >> Thatikunta, Alyssa Huang, Aman Singh, Andras Katona, Andrew Schofield,
> > > >> Andrew Grant, Aneel Kumar, Anton Agestam, Artem Livshits, atu-sharm,
> > > >> bachmanity1, Bill Bejeck, Bo Gao, Bruno Cadonna, Calvin Liu, Chaitanya
> > > >> Mukka, Chase Thomas, Cheryl Simmons, Chia-Ping Tsai, Chris Egerton,
> > > >> Christo Lolov, Clay Johnson, Colin P. McCabe, Colt McNealy, d00791190,
> > > >> Damon Xie, Danica Fine, Daniel Scanteianu, Daniel Urban, David Arthur,
> > > >> David Jacot, David Mao, dengziming, Deqi Hu, Dimitar Dimitrov, Divij
> > > >> Vaidya, DL1231, Dániel Urbán, Erik van Oosten, ezio, Farooq Qaiser,
> > > >> Federico Valeri, flashmouse, Florin Akermann, Gabriel Oliveira,
> > > >> Gantigmaa Selenge, Gaurav Narula, GeunJae Jeon, Greg Harris, Guozhang
> > > >> Wang, Hailey Ni, Hao Li, Hector Geraldino, hudeqi, hzh0425, Iblis Lin,
> > > >> iit2009060, Ismael Juma, Ivan Yurchenko, James Shaw, Jason Gustafson,
> > > >> Jeff Kim, Jim Galasyn, John Roesler, Joobi S B, Jorge Esteban Quilcate
> > > >> Otoya, Josep Prat, Joseph (Ting-Chou) Lin, José Armando García Sancio,
> > > >> Jun Rao, Justine Olshan, 

[jira] [Resolved] (KAFKA-15577) Reload4j | CVE-2022-45868

2023-10-11 Thread Bruno Cadonna (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Cadonna resolved KAFKA-15577.
---
Resolution: Not A Problem

> Reload4j | CVE-2022-45868
> -
>
> Key: KAFKA-15577
> URL: https://issues.apache.org/jira/browse/KAFKA-15577
> Project: Kafka
>  Issue Type: Bug
>Reporter: masood
>Priority: Critical
>
> Maven indicates 
> [CVE-2022-45868|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-45868]
>  in Reload4j.jar.
> [https://mvnrepository.com/artifact/ch.qos.reload4j/reload4j/1.2.19]
> Could you please verify if this vulnerability affects Kafka?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-960: Support single-key_single-timestamp interactive queries (IQv2) for versioned state stores

2023-10-11 Thread Bruno Cadonna

Thanks for the KIP, Alieh!

+1 (binding)

Best,
Bruno

On 10/10/23 1:14 AM, Matthias J. Sax wrote:
One more nit: as discussed on the related KIP-698 thread, we should not 
use `get` as prefix for the getters.


So it should be `K key()` and `Optional asOfTimestamp()`.


Otherwise the KIP LGTM.


+1 (binding)


-Matthias

On 10/6/23 2:50 AM, Alieh Saeedi wrote:

Hi everyone,

Since KIP-960 is reduced to the simplest IQ type and all further comments
are related to the following-up KIPs, I decided to finalize it at this
point.


A huge thank you to everyone who has reviewed this KIP (and also the
following-up ones), and
participated in the discussion thread!

I'd also like to thank you in advance for taking the time to vote.

Best,
Alieh



[jira] [Created] (KAFKA-15577) Reload4j | CVE-2022-45868

2023-10-11 Thread masood (Jira)
masood created KAFKA-15577:
--

 Summary: Reload4j | CVE-2022-45868
 Key: KAFKA-15577
 URL: https://issues.apache.org/jira/browse/KAFKA-15577
 Project: Kafka
  Issue Type: Bug
Reporter: masood


Maven indicates 
[CVE-2022-45868|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-45868] 
in Reload4j.jar.

[https://mvnrepository.com/artifact/ch.qos.reload4j/reload4j/1.2.19]

Could you please verify if this vulnerability affects Kafka?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Re: Re: [DISCUSS] KIP-972: Add the metric of the current running version of kafka

2023-10-11 Thread Sophie Blee-Goldman
Just to chime in here since I recently went through a similar thing, I
support adding the version
as a tag instead of introducing an entirely new metric for this. In fact I
just implemented exactly this
in a project that uses Kafka, for these reasons:

1. Adding the version as a tag means that all metrics which are already
collected will benefit, and lets you easily tell
at a glance which version a specific client metric corresponds to. This is
incredibly useful when looking at a dashboard
covering multiple instances from different sources. For example, imagine a
graph that plots the performance (eg bytes
consumed rate) of many individual consumers and which shows several of them
maxing out much lower than the rest.
If the metric is tagged with the version already, you can easily check if
the slow consumers are all using a specific version
and may be displaying a performance regression. If the version info has to
be plotted separately as its own metric, this is
much more of a hassle to check.
2. Additional metrics can be expensive, but additional tags are almost
always free (at least, that is my understanding)
3. As you guys already discussed, many systems (like Prometheus) require
numeric values, and it's pretty much impossible
to come up with a readable scheme for all the relevant versioning info --
even if we removed the dots we're left with a rather
unreadable representation of the version and of course will need to solve
the "-SNAPSHOT" issue somehow. But beyond that,
in addition to the raw version we also wanted to emit the specific commit
id, which really needs to be a string.

I'm pretty sure Kafka client metrics also include the commit id in addition
to the version. If we add the version to the tags,
we should consider adding the commit id as well. This is incredibly useful
for intermediate/SNAPSHOT versions, which
don't uniquely identify the specific code that is running.

I would personally love to see a KIP start tagging the existing metrics
with the version info, and it sounds like this would also
solve your problem in a very natural way

On Tue, Oct 10, 2023 at 5:42 AM Mickael Maison 
wrote:

> Hi Hudeqi,
>
> Rather than creating a gauge with a dummy value, could we add the
> version (and commitId) as tags to an existing metric.
> For example, the alongside the existing Version and CommitId metrics
> we have StartTimeMs. Maybe we can have a StartTimeMs metrics with the
> version and commitId) as tags on it? The existing metric already has
> the brokerid (id) as tag. WDYT?
>
> Thanks,
> Mickael
>
> On Thu, Aug 31, 2023 at 4:59 AM hudeqi <16120...@bjtu.edu.cn> wrote:
> >
> > Thank you for your answer, Mickael.
> > If set the value of gauge to a constant value of 1, adding that tag key
> is "version" and value is the version value of the obtained string type,
> does this solve the problem? We can get the version by tag in prometheus.
> >
> > best,
> > hudeqi
> >
> > Mickael Maison mickael.mai...@gmail.com写道:
> > > Hi,
> > >
> > > Prometheus only support numeric values for metrics. This means it's
> > > not able to handle the kafka.server:type=app-info metric since Kafka
> > > versions are not valid numbers (3.5.0).
> > > As a workaround we could create a metric with the version without the
> > > dots, for example with value 350 for Kafka 3.5.0.
> > >
> > > Also in between releases Kafka uses the -SNAPSHOT suffix (for example
> > > trunk is currently 3.7.0-SNAPSHOT) so we should also consider a way to
> > > handle those.
> > >
> > > Thanks,
> > > Mickael
> > >
> > > On Wed, Aug 30, 2023 at 2:51 PM hudeqi <16120...@bjtu.edu.cn> wrote:
> > > >
> > > > Hi, Kamal, thanks your reminding, but I have a question: It seems
> that I can't get this metric through "jmx_prometheus"? Although I observed
> this metric through other tools.
> > > >
> > > > best,
> > > > hudeqi
> > > >
> > > > Kamal Chandraprakash kamal.chandraprak...@gmail.com
> 写道:
> > > > > Hi Hudeqi,
> > > > >
> > > > > Kafka already emits the version metric. Can you check whether the
> below
> > > > > metric satisfies your requirement?
> > > > >
> > > > > kafka.server:type=app-info,id=0
> > > > >
> > > > > --
> > > > > Kamal
> > > > >
> > > > > On Mon, Aug 28, 2023 at 2:29 PM hudeqi <16120...@bjtu.edu.cn>
> wrote:
> > > > >
> > > > > > Hi, all, I want to submit a minor kip to add a metric, which
> supports to
> > > > > > get the running kafka server verison, the wiki url is here
> > > > > >
> > > > > > Motivation
> > > > > >
> > > > > > At present, it is impossible to perceive the Kafka version that
> the broker
> > > > > > is running from the perspective of metrics. If multiple Kafka
> versions are
> > > > > > deployed in a cluster due to various reasons, it is difficult
> for us to
> > > > > > intuitively understand the version distribution.
> > > > > >
> > > > > > So, I want to add a kafka version metric indicating the version
> of the
> > > > > > current running kafka server, it can help us to perceive the
> mixed
> > > > > > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-11 Thread Andrew Schofield
Matthias,
Yes, I think that’s a sensible way forward and the interface you propose looks 
good. I’ll update the KIP accordingly.

Thanks,
Andrew

> On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
>
> Andrew,
>
> yes I would like to get this change into KIP-714 right way. Seems to be 
> important, as we don't know if/when a follow-up KIP for Kafka Streams would 
> land.
>
> I was also thinking (and discussed with a few others) how to expose it, and 
> we would propose the following:
>
> We add a new method to `KafkaStreams` class:
>
>public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>
> The returned object is like below:
>
>  public class ClientsInstanceIds {
>// we only have a single admin client per KS instance
>String adminInstanceId();
>
>// we only have a single global consumer per KS instance (if any)
>// Optional<> because we might not have global-thread
>Optional globalConsumerInstanceId();
>
>// return a  ClientInstanceId> mapping
>// for the underlying (restore-)consumers/producers
>Map mainConsumerInstanceIds();
>Map restoreConsumerInstanceIds();
>Map producerInstanceIds();
> }
>
> For the `threadKey`, we would use some pattern like this:
>
>  [Stream|StateUpdater]Thread-
>
>
> Would this work from your POV?
>
>
>
> -Matthias
>
>
> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>> Hi Matthias,
>> Good point. Makes sense to me.
>> Is this something that can also be included in the proposed Kafka Streams 
>> follow-on KIP, or would you prefer that I add it to KIP-714?
>> I have a slight preference for the former to put all of the KS enhancements 
>> into a separate KIP.
>> Thanks,
>> Andrew
>>> On 7 Oct 2023, at 02:12, Matthias J. Sax  wrote:
>>>
>>> Thanks Andrew. SGTM.
>>>
>>> One point you did not address is the idea to add a method to `KafkaStreams` 
>>> similar to the proposed `clientInstanceId()` that will be added to 
>>> consumer/producer/admin clients.
>>>
>>> Without addressing this, Kafka Streams users won't have a way to get the 
>>> assigned `instanceId` of the internally created clients, and thus it would 
>>> be very difficult for them to know which metrics that the broker receives 
>>> belong to a Kafka Streams app. It seems they would only find the 
>>> `instanceIds` in the log4j output if they enable client logging?
>>>
>>> Of course, because there is multiple clients inside Kafka Streams, the 
>>> return type cannot be an single "String", but must be some some complex 
>>> data structure -- we could either add a new class, or return a 
>>> Map using a client key that maps to the `instanceId`.
>>>
>>> For example we could use the following key:
>>>
>>>   [Global]StreamThread[-][-restore][consumer|producer]
>>>
>>> (Of course, only the valid combination.)
>>>
>>> Or maybe even better, we might want to return a `Future` because collection 
>>> all the `instanceId` might be a blocking all on each client? I have already 
>>> a few idea how it could be implemented but I don't think it must be 
>>> discussed on the KIP, as it's an implementation detail.
>>>
>>> Thoughts?
>>>
>>>
>>> -Matthias
>>>
>>> On 10/6/23 4:21 AM, Andrew Schofield wrote:
 Hi Matthias,
 Thanks for your comments. I agree that a follow-up KIP for Kafka Streams 
 makes sense. This KIP currently has made a bit
 of an effort to embrace KS, but it’s not enough by a long way.
 I have removed `application.id `. This should be 
 done properly in the follow-up KIP. I don’t believe there’s a downside to
 removing it from this KIP.
 I have reworded the statement about temporarily. In practice, the 
 implementation of this KIP that’s going on while the voting
 progresses happens to use delta temporality, but that’s an implementation 
 detail. Supporting clients must support both
 temporalities.
 I thought about exposing the client instance ID as a metric, but 
 non-numeric metrics are not usual practice and tools
 do not universally support them. I don’t think the KIP is improved by 
 adding one now.
 I have also added constants for the various Config classes for 
 ENABLE_METRICS_PUSH_CONFIG, including to
 StreamsConfig. It’s best to be explicit about this.
 Thanks,
 Andrew
> On 2 Oct 2023, at 23:47, Matthias J. Sax  wrote:
>
> Hi,
>
> I did not pay attention to this KIP in the past; seems it was on-hold for 
> a while.
>
> Overall it sounds very useful, and I think we should extend this with a 
> follow up KIP for Kafka Streams. What is unclear to me at this point is 
> the statement:
>
>> Kafka Streams applications have an application.id configured and this 
>> identifier should be included as the application_id metrics label.
>
> The `application.id` is currently only used as the (main) consumer's 
> `group.id` (and is part of an auto-generated `client.id` if the user does 
> not set one).
>

Re: [ANNOUNCE] Apache Kafka 3.6.0

2023-10-11 Thread Luke Chen
Thanks for running the release, Satish!

BTW, 3.6.0 should be a major release, not a minor one. :)

Luke

On Wed, Oct 11, 2023 at 1:39 PM Satish Duggana  wrote:

> The Apache Kafka community is pleased to announce the release for
> Apache Kafka 3.6.0
>
> This is a minor release and it includes fixes and improvements from 238
> JIRAs.
>
> All of the changes in this release can be found in the release notes:
> https://www.apache.org/dist/kafka/3.6.0/RELEASE_NOTES.html
>
> An overview of the release can be found in our announcement blog post:
> https://kafka.apache.org/blog
>
> You can download the source and binary release (Scala 2.12 and Scala 2.13)
> from:
> https://kafka.apache.org/downloads#3.6.0
>
>
> ---
>
>
> Apache Kafka is a distributed streaming platform with four core APIs:
>
>
> ** The Producer API allows an application to publish a stream of records to
> one or more Kafka topics.
>
> ** The Consumer API allows an application to subscribe to one or more
> topics and process the stream of records produced to them.
>
> ** The Streams API allows an application to act as a stream processor,
> consuming an input stream from one or more topics and producing an
> output stream to one or more output topics, effectively transforming the
> input streams to output streams.
>
> ** The Connector API allows building and running reusable producers or
> consumers that connect Kafka topics to existing applications or data
> systems. For example, a connector to a relational database might
> capture every change to a table.
>
>
> With these APIs, Kafka can be used for two broad classes of application:
>
> ** Building real-time streaming data pipelines that reliably get data
> between systems or applications.
>
> ** Building real-time streaming applications that transform or react
> to the streams of data.
>
>
> Apache Kafka is in use at large and small companies worldwide, including
> Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest, Rabobank,
> Target, The New York Times, Uber, Yelp, and Zalando, among others.
>
> A big thank you for the following 139 contributors to this release!
> (Please report an unintended omission)
>
> This was a community effort, so thank you to everyone who contributed
> to this release, including all our users and our 139 contributors:
> A. Sophie Blee-Goldman, Aaron Ai, Abhijeet Kumar, aindriu-aiven,
> Akhilesh Chaganti, Alexandre Dupriez, Alexandre Garnier, Alok
> Thatikunta, Alyssa Huang, Aman Singh, Andras Katona, Andrew Schofield,
> Andrew Grant, Aneel Kumar, Anton Agestam, Artem Livshits, atu-sharm,
> bachmanity1, Bill Bejeck, Bo Gao, Bruno Cadonna, Calvin Liu, Chaitanya
> Mukka, Chase Thomas, Cheryl Simmons, Chia-Ping Tsai, Chris Egerton,
> Christo Lolov, Clay Johnson, Colin P. McCabe, Colt McNealy, d00791190,
> Damon Xie, Danica Fine, Daniel Scanteianu, Daniel Urban, David Arthur,
> David Jacot, David Mao, dengziming, Deqi Hu, Dimitar Dimitrov, Divij
> Vaidya, DL1231, Dániel Urbán, Erik van Oosten, ezio, Farooq Qaiser,
> Federico Valeri, flashmouse, Florin Akermann, Gabriel Oliveira,
> Gantigmaa Selenge, Gaurav Narula, GeunJae Jeon, Greg Harris, Guozhang
> Wang, Hailey Ni, Hao Li, Hector Geraldino, hudeqi, hzh0425, Iblis Lin,
> iit2009060, Ismael Juma, Ivan Yurchenko, James Shaw, Jason Gustafson,
> Jeff Kim, Jim Galasyn, John Roesler, Joobi S B, Jorge Esteban Quilcate
> Otoya, Josep Prat, Joseph (Ting-Chou) Lin, José Armando García Sancio,
> Jun Rao, Justine Olshan, Kamal Chandraprakash, Keith Wall, Kirk True,
> Lianet Magrans, LinShunKang, Liu Zeyu, lixy, Lucas Bradstreet, Lucas
> Brutschy, Lucent-Wong, Lucia Cerchie, Luke Chen, Manikumar Reddy,
> Manyanda Chitimbo, Maros Orsak, Matthew de Detrich, Matthias J. Sax,
> maulin-vasavada, Max Riedel, Mehari Beyene, Michal Cabak (@miccab),
> Mickael Maison, Milind Mantri, minjian.cai, mojh7, Nikolay, Okada
> Haruki, Omnia G H Ibrahim, Owen Leung, Philip Nee, prasanthV, Proven
> Provenzano, Purshotam Chauhan, Qichao Chu, Rajini Sivaram, Randall
> Hauch, Renaldo Baur Filho, Ritika Reddy, Rittika Adhikari, Rohan, Ron
> Dagostino, Sagar Rao, Said Boudjelda, Sambhav Jain, Satish Duggana,
> sciclon2, Shekhar Rajak, Sungyun Hur, Sushant Mahajan, Tanay
> Karmarkar, tison, Tom Bentley, vamossagar12, Victoria Xia, Vincent
> Jiang, vveicc, Walker Carlson, Yash Mayya, Yi-Sheng Lien, Ziming Deng,
> 蓝士钦
>
> We welcome your help and feedback. For more information on how to
> report problems, and to get involved, visit the project website at
> https://kafka.apache.org/
>
> Thank you!
>
> Regards,
> Satish Duggana
>


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2270

2023-10-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 315908 lines...]

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
onlyRemovePendingTaskToCloseCleanShouldRemoveTaskFromPendingUpdateActions() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldDrainPendingTasksToCreate() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldDrainPendingTasksToCreate() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
onlyRemovePendingTaskToRecycleShouldRemoveTaskFromPendingUpdateActions() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseClean() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldAddAndRemovePendingTaskToCloseDirty() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldKeepAddedTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > TasksTest > 
shouldKeepAddedTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > StateQueryResultTest 
> More than one query result throws IllegalArgumentException PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > StateQueryResultTest 
> Zero query results shouldn't error STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > StateQueryResultTest 
> Zero query results shouldn't error PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > StateQueryResultTest 
> Valid query results still works STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > StateQueryResultTest 
> Valid query results still works PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@6ebf3b09, 
org.apache.kafka.test.MockInternalProcessorContext@250fe06a STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@6ebf3b09, 
org.apache.kafka.test.MockInternalProcessorContext@250fe06a PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@35ea6baa, 
org.apache.kafka.test.MockInternalProcessorContext@604398c7 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > shouldRecordCorrectBlockCacheUsage(RocksDBStore, 
StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@35ea6baa, 
org.apache.kafka.test.MockInternalProcessorContext@604398c7 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@455cffb, 
org.apache.kafka.test.MockInternalProcessorContext@412ea7db STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [1] 
org.apache.kafka.streams.state.internals.RocksDBStore@455cffb, 
org.apache.kafka.test.MockInternalProcessorContext@412ea7db PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@34fa6341, 
org.apache.kafka.test.MockInternalProcessorContext@9ef5406 STARTED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCachePinnedUsage(RocksDBStore, StateStoreContext) > [2] 
org.apache.kafka.streams.state.internals.RocksDBTimestampedStore@34fa6341, 
org.apache.kafka.test.MockInternalProcessorContext@9ef5406 PASSED

Gradle Test Run :streams:test > Gradle Test Executor 83 > 
RocksDBBlockCacheMetricsTest > 
shouldRecordCorrectBlockCacheCapacity(RocksDBStore, StateStoreContext) > [1]