Jenkins build is unstable: Kafka » Kafka Branch Builder » 3.4 #49

2023-01-20 Thread Apache Jenkins Server
See 




[DISCUSS] Tiered-Storage: Implement RocksdbBasedMetadataCache for TopicBased RLMM

2023-01-20 Thread hzh0425
Background:
In KIP-405: Kafka Tiered Storage - Apache Kafka - Apache Software Foundation,  
kafka introduced the feature of hierarchical storage, and RLMM is responsible 
for storing remote segment's metadata.


BTW, [KAFKA-9555] Topic-based implementation for the RemoteLogMetadataManager - 
ASF JIRA (apache.org) implements the default RLMM - 'TopicBased-RLMM'.




Problem:
TopicBased RLMM stores all metadata of subscriptions in memory.

In our practice, we found that when the metadata gradually increases, there 
will be a huge burden on the broker's memory (GB level), and at the same time, 
it will be very time-consuming to save the snapshot of the full amount of 
metadata to the disk.




Solution

We hope to introduce rocksdb to solve this problem:
- Implement a RocksdbBasedMetadataCache
- All metadata is stored on disk, only a small amount of rocksdb memory cache 
is required.
- There is no need to bear the time consumption caused by saving the full 
amount of snapshot metadata to disk, rocksdb can guarantee incremental storage.


You are welcome to discuss this Improvement by replying email !


Thanks,
Hzh0425


| |
hzhkafka
|
|
hzhka...@163.com
|

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #1527

2023-01-20 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #1526

2023-01-20 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 528129 lines...]
[2023-01-21T01:11:52.470Z] 
[2023-01-21T01:11:52.470Z] > Task :clients:javadoc
[2023-01-21T01:11:52.470Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginCallbackHandler.java:151:
 warning - Tag @link: reference not found: 
[2023-01-21T01:11:52.470Z] 
[2023-01-21T01:11:52.470Z] > Task :connect:api:copyDependantLibs UP-TO-DATE
[2023-01-21T01:11:52.470Z] > Task :connect:api:jar UP-TO-DATE
[2023-01-21T01:11:52.470Z] > Task 
:connect:api:generateMetadataFileForMavenJavaPublication
[2023-01-21T01:11:53.416Z] > Task :connect:json:copyDependantLibs UP-TO-DATE
[2023-01-21T01:11:53.416Z] > Task :connect:json:jar UP-TO-DATE
[2023-01-21T01:11:53.416Z] > Task 
:connect:json:generateMetadataFileForMavenJavaPublication
[2023-01-21T01:11:53.416Z] > Task :connect:api:javadocJar
[2023-01-21T01:11:53.416Z] > Task :connect:api:compileTestJava UP-TO-DATE
[2023-01-21T01:11:53.416Z] > Task :connect:api:testClasses UP-TO-DATE
[2023-01-21T01:11:53.416Z] > Task 
:connect:json:publishMavenJavaPublicationToMavenLocal
[2023-01-21T01:11:53.416Z] > Task :connect:json:publishToMavenLocal
[2023-01-21T01:11:53.416Z] > Task :connect:api:testJar
[2023-01-21T01:11:53.416Z] > Task :connect:api:testSrcJar
[2023-01-21T01:11:53.416Z] > Task 
:connect:api:publishMavenJavaPublicationToMavenLocal
[2023-01-21T01:11:53.416Z] > Task :connect:api:publishToMavenLocal
[2023-01-21T01:11:53.416Z] 
[2023-01-21T01:11:53.416Z] > Task :streams:javadoc
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/TopologyConfig.java:62:
 warning - Tag @link: missing '#': "org.apache.kafka.streams.StreamsBuilder()"
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/TopologyConfig.java:62:
 warning - Tag @link: can't find org.apache.kafka.streams.StreamsBuilder() in 
org.apache.kafka.streams.TopologyConfig
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/TopologyDescription.java:38:
 warning - Tag @link: reference not found: ProcessorContext#forward(Object, 
Object) forwards
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/query/Position.java:44:
 warning - Tag @link: can't find query(Query,
[2023-01-21T01:11:53.416Z]  PositionBound, boolean) in 
org.apache.kafka.streams.processor.StateStore
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:109:
 warning - Tag @link: reference not found: this#getResult()
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:116:
 warning - Tag @link: reference not found: this#getFailureReason()
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:116:
 warning - Tag @link: reference not found: this#getFailureMessage()
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:154:
 warning - Tag @link: reference not found: this#isSuccess()
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:154:
 warning - Tag @link: reference not found: this#isFailure()
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:890:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:919:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-21T01:11:53.416Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:939:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-21T01:11:53.416Z] 25 warnings
[2023-01-21T01:11:54.362Z] 
[2023-01-21T01:11:54.362Z] > Task :streams:javadocJar
[2023-01-21T01:11:55.307Z] 
[2023-01-21T01:11:55.307Z] > Task :clients:javadoc
[2023-01-21T01:11:55.308Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk@2/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/secured/package-info.java:21:
 warning - Tag @link: reference not found: 
org.apache.kafka.common.sec

Jenkins build is unstable: Kafka » Kafka Branch Builder » 3.4 #47

2023-01-20 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Artem Livshits
>  looks like we already have code to handle bumping the epoch and
when the epoch is Short.MAX_VALUE, we get a new producer ID.

My understanding is that this logic is currently encapsulated in the broker
and the client doesn't really know at which epoch value the new producer id
is generated.  With the new protocol, the client would need to be aware.
We don't need to change the logic, just document it.  With our
implementation, once epoch reaches Short.MAX_VALUE it cannot be used
further, but a naïve client implementer may miss this point and it may be
missed in testing if the tests don't overflow the epoch, and then once they
hit the issue, it's not immediately obvious from the KIP how to handle it.
Explicitly documenting this point in the KIP would help to avoid (or
quickly resolve) such issues.

-Artem

On Wed, Jan 18, 2023 at 3:01 PM Justine Olshan 
wrote:

> Yeah -- looks like we already have code to handle bumping the epoch and
> when the epoch is Short.MAX_VALUE, we get a new producer ID. Since this is
> already the behavior, do we want to change it further?
>
> Justine
>
> On Wed, Jan 18, 2023 at 1:12 PM Justine Olshan 
> wrote:
>
> > Hey all, just wanted to quickly update and say I've modified the KIP to
> > explicitly mention that AddOffsetCommitsToTxnRequest will be replaced by
> > a coordinator-side (inter-broker) AddPartitionsToTxn implicit request.
> This
> > mirrors the user partitions and will implicitly add offset partitions to
> > transactions when we commit offsets on them. We will deprecate
> AddOffsetCommitsToTxnRequest
> > for new clients.
> >
> > Also to address Artem's comments --
> > I'm a bit unsure if the changes here will change the previous behavior
> for
> > fencing producers. In the case you mention in the first paragraph, are
> you
> > saying we bump the epoch before we try to abort the transaction? I think
> I
> > need to understand the scenarios you mention a bit better.
> >
> > As for the second part -- I think it makes sense to have some sort of
> > "sentinel" epoch to signal epoch is about to overflow (I think we sort of
> > have this value in place in some ways) so we can codify it in the KIP.
> I'll
> > look into that and try to update soon.
> >
> > Thanks,
> > Justine.
> >
> > On Fri, Jan 13, 2023 at 5:01 PM Artem Livshits
> >  wrote:
> >
> >> It's good to know that KIP-588 addressed some of the issues.  Looking at
> >> the code, it still looks like there are some cases that would result in
> >> fatal error, e.g. PRODUCER_FENCED is issued by the transaction
> coordinator
> >> if epoch doesn't match, and the client treats it as a fatal error (code
> in
> >> TransactionManager request handling).  If we consider, for example,
> >> committing a transaction that returns a timeout, but actually succeeds,
> >> trying to abort it or re-commit may result in PRODUCER_FENCED error
> >> (because of epoch bump).
> >>
> >> For failed commits, specifically, we need to know the actual outcome,
> >> because if we return an error the application may think that the
> >> transaction is aborted and redo the work, leading to duplicates.
> >>
> >> Re: overflowing epoch.  We could either do it on the TC and return both
> >> producer id and epoch (e.g. change the protocol), or signal the client
> >> that
> >> it needs to get a new producer id.  Checking for max epoch could be a
> >> reasonable signal, the value to check should probably be present in the
> >> KIP
> >> as this is effectively a part of the contract.  Also, the TC should
> >> probably return an error if the client didn't change producer id after
> >> hitting max epoch.
> >>
> >> -Artem
> >>
> >>
> >> On Thu, Jan 12, 2023 at 10:31 AM Justine Olshan
> >>  wrote:
> >>
> >> > Thanks for the discussion Artem.
> >> >
> >> > With respect to the handling of fenced producers, we have some
> behavior
> >> > already in place. As of KIP-588:
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
> >> > ,
> >> > we handle timeouts more gracefully. The producer can recover.
> >> >
> >> > Produce requests can also recover from epoch fencing by aborting the
> >> > transaction and starting over.
> >> >
> >> > What other cases were you considering that would cause us to have a
> >> fenced
> >> > epoch but we'd want to recover?
> >> >
> >> > The first point about handling epoch overflows is fair. I think there
> is
> >> > some logic we'd need to consider. (ie, if we are one away from the max
> >> > epoch, we need to reset the producer ID.) I'm still wondering if there
> >> is a
> >> > way to direct this from the response, or if everything should be done
> on
> >> > the client side. Let me know if you have any thoughts here.
> >> >
> >> > Thanks,
> >> > Justine
> >> >
> >> > On Tue, Jan 10, 2023 at 4:06 PM Artem Livshits
> >> >  wrote:
> >> >
> >> > > There are some workflows in the client that are implied by protocol
> >> > > changes, e.g.:
> >> > >
> >> > 

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Justine Olshan
That's a fair point about other clients.

I think the abortable error case is interesting because I'm curious how
other clients would handle this. I assume they would need to implement
handling for the error code unless they did something like "any unknown
error codes/any codes that aren't x,y,z are retriable." I would hope that
unknown error codes were fatal, and if the code was implemented it would
abort the transaction. But I will think on this too.

As for InvalidRecord -- you mentioned it was not fatal, but I'm taking a
look through the code. We would see this on handling the produce response.
If I recall correctly, we check if errors are retriable. I think this error
would not be retriable. But I guess the concern here is that it is not
enough for just that batch to fail. I guess I hadn't considered fully
fencing the old producer but there are valid arguments here why we would
want to.

Thanks,
Justine

On Fri, Jan 20, 2023 at 2:35 PM Guozhang Wang 
wrote:

> Thanks Justine for the replies! I agree with most of your thoughts.
>
> Just for 3/7), though I agree for our own AK producer, since we do
> "nextRequest(boolean hasIncompleteBatches)", we guarantee the end-txn
> would not be sent until we've effectively flushed, but I was referring
> to any future bugs or other buggy clients that the same client may get
> into this situation, in which case we should give the client a clear
> msg that "you did something wrong, and hence now you should fatally
> close yourself". What I'm concerned about is that, by seeing an
> "abortable error" or in some rare cases an "invalid record", the
> client could not realize "something that's really bad happened". So
> it's not about adding a new error, it's mainly about those real buggy
> situations causing such "should never happen" cases, the errors return
> would not be informative enough.
>
> Thinking in other ways, if we believe that for most cases such error
> codes would not reach the original clients since they would be
> disconnected or even gone by that time, and only in some rare cases
> they would still be seen by the sending clients, then why not make
> them more fatal and more specific than generic.
>
> Guozhang
>
> On Fri, Jan 20, 2023 at 1:59 PM Justine Olshan
>  wrote:
> >
> > Hey Guozhang. Thanks for taking a look and for the detailed comments!
> I'll
> > do my best to address below.
> >
> > 1. I see what you are saying here, but I think I need to look through the
> > sequence of events you mention. Typically we've seen this issue in a few
> > cases.
> >
> >  One is when we have a producer disconnect when trying to produce.
> > Typically in these cases, we abort the transaction. We've seen that after
> > the markers are written, the disconnection can sometimes cause the
> request
> > to get flushed to the broker. In this case, we don't need client handling
> > because the producer we are responding to is gone. We just needed to make
> > sure we didn't write to the log on the broker side. I'm trying to think
> of
> > a case where we do have the client to return to. I'd think the same
> client
> > couldn't progress to committing the transaction unless the produce
> request
> > returned right? Of course, there is the incorrectly written clients case.
> > I'll think on this a bit more and let you know if I come up with another
> > scenario when we would return to an active client when the transaction is
> > no longer ongoing.
> >
> > I was not aware that we checked the result of a send after we commit
> > though. I'll need to look into that a bit more.
> >
> > 2. There were some questions about this in the discussion. The plan is to
> > handle overflow with the mechanism we currently have in the producer. If
> we
> > try to bump and the epoch will overflow, we actually allocate a new
> > producer ID. I need to confirm the fencing logic on the last epoch (ie,
> we
> > probably shouldn't allow any records to be produced with the final epoch
> > since we can never properly fence that one).
> >
> > 3. I can agree with you that the current error handling is messy. I
> recall
> > taking a look at your KIP a while back, but I think I mostly saw the
> > section about how the errors were wrapped. Maybe I need to take another
> > look. As for abortable error, the idea was that the handling would be
> > simple -- if this error is seen, the transaction should be aborted -- no
> > other logic about previous state or requests necessary. Is your concern
> > simply about adding new errors? We were hoping to have an error that
> would
> > have one meaning and many of the current errors have a history of meaning
> > different things on different client versions. That was the main
> motivation
> > for adding a new error.
> >
> > 4. This is a good point about record timestamp reordering. Timestamps
> don't
> > affect compaction, but they do affect retention deletion. For that, kafka
> > considers the largest timestamp in the segment, so I think a small amount
> > of reorderin

[jira] [Created] (KAFKA-14644) Process should stop after failure in raft IO thread

2023-01-20 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14644:
---

 Summary: Process should stop after failure in raft IO thread
 Key: KAFKA-14644
 URL: https://issues.apache.org/jira/browse/KAFKA-14644
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


We have seen a few cases where an unexpected error in the Raft IO thread causes 
the process to enter a zombie state where it is no longer participating in the 
raft quorum. In this state, a controller can no longer become leader or help in 
elections, and brokers can no longer update metadata. It may be better to stop 
the process in this case since there is no way to recover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Guozhang Wang
Thanks Justine for the replies! I agree with most of your thoughts.

Just for 3/7), though I agree for our own AK producer, since we do
"nextRequest(boolean hasIncompleteBatches)", we guarantee the end-txn
would not be sent until we've effectively flushed, but I was referring
to any future bugs or other buggy clients that the same client may get
into this situation, in which case we should give the client a clear
msg that "you did something wrong, and hence now you should fatally
close yourself". What I'm concerned about is that, by seeing an
"abortable error" or in some rare cases an "invalid record", the
client could not realize "something that's really bad happened". So
it's not about adding a new error, it's mainly about those real buggy
situations causing such "should never happen" cases, the errors return
would not be informative enough.

Thinking in other ways, if we believe that for most cases such error
codes would not reach the original clients since they would be
disconnected or even gone by that time, and only in some rare cases
they would still be seen by the sending clients, then why not make
them more fatal and more specific than generic.

Guozhang

On Fri, Jan 20, 2023 at 1:59 PM Justine Olshan
 wrote:
>
> Hey Guozhang. Thanks for taking a look and for the detailed comments! I'll
> do my best to address below.
>
> 1. I see what you are saying here, but I think I need to look through the
> sequence of events you mention. Typically we've seen this issue in a few
> cases.
>
>  One is when we have a producer disconnect when trying to produce.
> Typically in these cases, we abort the transaction. We've seen that after
> the markers are written, the disconnection can sometimes cause the request
> to get flushed to the broker. In this case, we don't need client handling
> because the producer we are responding to is gone. We just needed to make
> sure we didn't write to the log on the broker side. I'm trying to think of
> a case where we do have the client to return to. I'd think the same client
> couldn't progress to committing the transaction unless the produce request
> returned right? Of course, there is the incorrectly written clients case.
> I'll think on this a bit more and let you know if I come up with another
> scenario when we would return to an active client when the transaction is
> no longer ongoing.
>
> I was not aware that we checked the result of a send after we commit
> though. I'll need to look into that a bit more.
>
> 2. There were some questions about this in the discussion. The plan is to
> handle overflow with the mechanism we currently have in the producer. If we
> try to bump and the epoch will overflow, we actually allocate a new
> producer ID. I need to confirm the fencing logic on the last epoch (ie, we
> probably shouldn't allow any records to be produced with the final epoch
> since we can never properly fence that one).
>
> 3. I can agree with you that the current error handling is messy. I recall
> taking a look at your KIP a while back, but I think I mostly saw the
> section about how the errors were wrapped. Maybe I need to take another
> look. As for abortable error, the idea was that the handling would be
> simple -- if this error is seen, the transaction should be aborted -- no
> other logic about previous state or requests necessary. Is your concern
> simply about adding new errors? We were hoping to have an error that would
> have one meaning and many of the current errors have a history of meaning
> different things on different client versions. That was the main motivation
> for adding a new error.
>
> 4. This is a good point about record timestamp reordering. Timestamps don't
> affect compaction, but they do affect retention deletion. For that, kafka
> considers the largest timestamp in the segment, so I think a small amount
> of reordering (hopefully on the order of milliseconds or even seconds) will
> be ok. We take timestamps from clients so there is already a possibility
> for some drift and non-monotonically increasing timestamps.
>
> 5. Thanks for catching. The error is there, but it's actually that those
> fields should be 4+! Due to how the message generator works, I actually
> have to redefine those fields inside the `"AddPartitionsToTxnTransaction`
> block for it to build correctly. I'll fix it to be correct.
>
> 6. Correct -- we will only add the request to purgatory if the cache has no
> ongoing transaction. I can change the wording to make that clearer that we
> only place the request in purgatory if we need to contact the transaction
> coordinator.
>
> 7. We did take a look at some of the errors and it was hard to come up with
> a good one. I agree that InvalidTxnStateException is ideal except for the
> fact that it hasn't been returned on Produce requests before. The error
> handling for clients is a bit vague (which is why I opened KAFKA-14439
> ), but the decision we
> made here was to on

Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Justine Olshan
Hey Guozhang. Thanks for taking a look and for the detailed comments! I'll
do my best to address below.

1. I see what you are saying here, but I think I need to look through the
sequence of events you mention. Typically we've seen this issue in a few
cases.

 One is when we have a producer disconnect when trying to produce.
Typically in these cases, we abort the transaction. We've seen that after
the markers are written, the disconnection can sometimes cause the request
to get flushed to the broker. In this case, we don't need client handling
because the producer we are responding to is gone. We just needed to make
sure we didn't write to the log on the broker side. I'm trying to think of
a case where we do have the client to return to. I'd think the same client
couldn't progress to committing the transaction unless the produce request
returned right? Of course, there is the incorrectly written clients case.
I'll think on this a bit more and let you know if I come up with another
scenario when we would return to an active client when the transaction is
no longer ongoing.

I was not aware that we checked the result of a send after we commit
though. I'll need to look into that a bit more.

2. There were some questions about this in the discussion. The plan is to
handle overflow with the mechanism we currently have in the producer. If we
try to bump and the epoch will overflow, we actually allocate a new
producer ID. I need to confirm the fencing logic on the last epoch (ie, we
probably shouldn't allow any records to be produced with the final epoch
since we can never properly fence that one).

3. I can agree with you that the current error handling is messy. I recall
taking a look at your KIP a while back, but I think I mostly saw the
section about how the errors were wrapped. Maybe I need to take another
look. As for abortable error, the idea was that the handling would be
simple -- if this error is seen, the transaction should be aborted -- no
other logic about previous state or requests necessary. Is your concern
simply about adding new errors? We were hoping to have an error that would
have one meaning and many of the current errors have a history of meaning
different things on different client versions. That was the main motivation
for adding a new error.

4. This is a good point about record timestamp reordering. Timestamps don't
affect compaction, but they do affect retention deletion. For that, kafka
considers the largest timestamp in the segment, so I think a small amount
of reordering (hopefully on the order of milliseconds or even seconds) will
be ok. We take timestamps from clients so there is already a possibility
for some drift and non-monotonically increasing timestamps.

5. Thanks for catching. The error is there, but it's actually that those
fields should be 4+! Due to how the message generator works, I actually
have to redefine those fields inside the `"AddPartitionsToTxnTransaction`
block for it to build correctly. I'll fix it to be correct.

6. Correct -- we will only add the request to purgatory if the cache has no
ongoing transaction. I can change the wording to make that clearer that we
only place the request in purgatory if we need to contact the transaction
coordinator.

7. We did take a look at some of the errors and it was hard to come up with
a good one. I agree that InvalidTxnStateException is ideal except for the
fact that it hasn't been returned on Produce requests before. The error
handling for clients is a bit vague (which is why I opened KAFKA-14439
), but the decision we
made here was to only return errors that have been previously returned to
producers. As for not being fatal, I think part of the theory was that in
many cases, the producer would be disconnected. (See point 1) and this
would just be an error to return from the server. I did plan to think about
other cases, so let me know if you think of any as well!

Lots to say! Let me know if you have further thoughts!
Justine

On Fri, Jan 20, 2023 at 11:21 AM Guozhang Wang 
wrote:

> Hello Justine,
>
> Thanks for the great write-up! I made a quick pass through it and here
> are some thoughts (I have not been able to read through this thread so
> pardon me if they have overlapped or subsumed by previous comments):
>
> First are some meta ones:
>
> 1. I think we need to also improve the client's experience once we
> have this defence in place. More concretely, say a user's producer
> code is like following:
>
> future = producer.send();
> // producer.flush();
> producer.commitTransaction();
> future.get();
>
> Which resulted in the order of a) produce-request sent by producer, b)
> end-txn-request sent by producer, c) end-txn-response sent back, d)
> txn-marker-request sent from coordinator to partition leader, e)
> produce-request finally received by the partition leader, before this
> KIP e) step would be accepted causing a dangling txn; now it would be
> rejected in st

Re: [DISCUSS] KIP-900: KRaft kafka-storage.sh API additions to support SCRAM

2023-01-20 Thread Proven Provenzano
Hi Colin,
Thanks for the response.

I chose raw records, thinking it might be useful for future additions of
records that customers might want to add before the first start of the
cluster. I do see that it is at best an engineer friendly interface.

I do think kafka-storage is the correct place to put the logic for adding
records to the bootstrap.checkpoint file. I think keeping the logic for
managing the bootstrap separate from the logic of configuring an existing
cluster that is already running is a good division of functionality and I
think this separation will reduce the parsing logic significantly.

The API suggestion you made for kafka-storage is okay. I would prefer to
have one option for an entire SCRAM config including the user, such as the
following:

./bin/kafka-storage.sh format \
  --config [my-config-path] \
  --cluster-id mb0Zz1YPTUeVzpedHHPT-Q \
  --release-version 3.5-IV0 \
  --scram-config
user=alice 'SCRAM-SHA-256=[iterations=8192,password=alicepass]' \
  --scram-config user=bob 'SCRAM-SHA-256=[password=bobpass]'

Argparse4j supports multiple option arguments to a single option including
having an optional number of option arguments to a single option.

I think adding the Argparse4j support for reading the arguments from a file
is a must.

--Proven


On Thu, Jan 19, 2023 at 7:07 PM Colin McCabe  wrote:

> Hi Proven,
>
> Thanks for putting this together.
>
> We always intended to have a way to bootstrap into using an all-SCRAM
> cluster, from scratch.
>
> I have two big comments here. First, I think we need a better interface
> than raw records. And second, I'm not sure that kafka-storage.sh is the
> right place to put this.
>
> I think raw records are going to be tough for people to use, because there
> are a lot of fields, and the values to set them to are not intuitive. For
> example, to set SHA512, the user needs to set "mechanism" equal to 2. That
> is going to be impossible to remember or figure out without looking at the
> source code. The other thing of course is that we may add more fields over
> time, including mandatory ones. So any documentation could quickly get out
> of date.
>
> I think people are going to want to specify SCRAM users here the same way
> they do when using the kafka-configs.sh tool. As a reminder, using
> kafka-configs.sh, they specify users like this:
>
> ./bin/kafka-configs --bootstrap-server localhost:9092 --alter \
>   --add-config 'SCRAM-SHA-256=[iterations=8192,password=pass]' \
>   --entity-type users \
>   --entity-name alice
>
> Of course, in this example, we're not specifying a salt. So we'd have to
> evaluate whether that's what we want for our use-case as well. On the plus
> side, specifying a salt could ensure that the bootstrap files end up
> identical on every node. On the minus side, it is another random number
> that users would need to generate and explicitly pass in.
>
> I would lean towards auto-generating the salt. I don't think the salt
> needs to be the same on all nodes. Only one controller will become active
> and write the bootstrap records to the log; no other controllers will do
> that. Brokers don't need to read the SCRAM records out of the bootstrap
> file.
>
> If we put all the functionality into kafka-storage.sh, it might look
> something like this:
>
> ./bin/kafka-storage.sh format \
>   --config [my-config-path] \
>   --cluster-id mb0Zz1YPTUeVzpedHHPT-Q \
>   --release-version 3.5-IV0 \
>   --scram-user alice \
>   --scram-config 'SCRAM-SHA-256=[iterations=8192,password=alicepass]' \
>   --scram-user bob \
>   --scram-config 'SCRAM-SHA-256=[password=bobpass]'
>
> (Here I am assuming that each --scram-user must be followed by exactly on
> --scram-config line)
>
> Perhaps it's worth considering whether it woudl be better to add a mode to
> kafka-configs.sh where it appends to a bootstrap file.
>
> If we do put everything into kafka-storage.sh, we should consider the
> plight of people with low limits on the maximum length of their command
> lines. One fix for these people could be allowing them to read their
> arguments from a file like this:
>
> $ ./bin/kafka-storage.sh @myfile
> $ cat myfile:
>   ./bin/kafka-storage.sh format \
> --config [my-config-path] \
>   ...
> [etc, etc.]
>
> Argparse4j supports this natively with fromFilePrefix. See
> https://argparse4j.github.io/usage.html#fromfileprefix
>
> best,
> Colin
>
>
> On Thu, Jan 19, 2023, at 11:08, Proven Provenzano wrote:
> > I have written a KIP describing the API additions needed to
> > kafka-storage
> > to store SCRAM
> > credentials at bootstrap time. Please take a look at
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-900%3A+KRaft+kafka-storage.sh+API+additions+to+support+SCRAM+for+Kafka+Brokers
> >
> > --
> > --Proven
>


Re: [DISCUSS] KIP-890 Server Side Defense

2023-01-20 Thread Guozhang Wang
Hello Justine,

Thanks for the great write-up! I made a quick pass through it and here
are some thoughts (I have not been able to read through this thread so
pardon me if they have overlapped or subsumed by previous comments):

First are some meta ones:

1. I think we need to also improve the client's experience once we
have this defence in place. More concretely, say a user's producer
code is like following:

future = producer.send();
// producer.flush();
producer.commitTransaction();
future.get();

Which resulted in the order of a) produce-request sent by producer, b)
end-txn-request sent by producer, c) end-txn-response sent back, d)
txn-marker-request sent from coordinator to partition leader, e)
produce-request finally received by the partition leader, before this
KIP e) step would be accepted causing a dangling txn; now it would be
rejected in step e) which is good. But from the client's point of view
now it becomes confusing since the `commitTransaction()` returns
successfully, but the "future" throws an invalid-epoch error, and they
are not sure if the transaction did succeed or not. In fact, it
"partially succeeded" with some msgs being rejected but others
committed successfully.

Of course the easy way to avoid this is, always call
"producer.flush()" before commitTxn and that's what we do ourselves,
and what we recommend users do. But I suspect not everyone does it. In
fact I just checked the javadoc in KafkaProducer and our code snippet
does not include a `flush()` call. So I'm thinking maybe we can in
side the `commitTxn` code to enforce flushing before sending the
end-txn request.

2. I'd like to clarify a bit details on "just add partitions to the
transaction on the first produce request during a transaction". My
understanding is that the partition leader's cache has the producer id
/ sequence / epoch for the latest txn, either on-going or is completed
(upon receiving the marker request from coordinator). When a produce
request is received, if

* producer's epoch < cached epoch, or producer's epoch == cached epoch
but the latest txn is completed, leader directly reject with
invalid-epoch.
* producer's epoch > cached epoch, park the the request and send
add-partitions request to coordinator.

In order to do it, does the coordinator need to bump the sequence and
reset epoch to 0 when the next epoch is going to overflow? If no need
to do so, then how we handle the (admittedly rare, but still may
happen) epoch overflow situation?

3. I'm a bit concerned about adding a generic "ABORTABLE_ERROR" given
we already have a pretty messy error classification and error handling
on the producer clients side --- I have a summary about the issues and
a proposal to address this in
https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
-- I understand we do not want to use "UNKNOWN_PRODUCER_ID" anymore
and in fact we intend to deprecate it in KIP-360 and eventually remove
it; but I'm wondering can we still use specific error codes. E.g. what
about "InvalidProducerEpochException" since for new clients, the
actual reason this would actually be rejected is indeed because the
epoch on the coordinator caused the add-partitions-request from the
brokers to be rejected anyways?

4. It seems we put the producer request into purgatory before we ever
append the records, while other producer's records may still be
appended during the time; and that potentially may result in some
re-ordering compared with reception order. I'm not super concerned
about it since Kafka does not guarantee reception ordering across
producers anyways, but it may make the timestamps of records inside a
partition to be more out-of-ordered. Are we aware of any scenarios
such as future enhancements on log compactions that may be affected by
this effect?

Below are just minor comments:

5. In "AddPartitionsToTxnTransaction" field of
"AddPartitionsToTxnRequest" RPC, the versions of those inner fields
are "0-3" while I thought they should be "0+" still?

6. Regarding "we can place the request in a purgatory of sorts and
check if there is any state for the transaction on the broker": i
think at this time when we just do the checks against the cached
state, we do not need to put the request to purgatory yet?

7. This is related to 3) above. I feel using "InvalidRecordException"
for older clients may also be a bit confusing, and also it is not
fatal -- for old clients, it better to be fatal since this indicates
the clients is doing something wrong and hence it should be closed.
And in general I'd prefer to use slightly more specific meaning error
codes for clients. That being said, I also feel
"InvalidProducerEpochException" is not suitable for old versioned
clients, and we'd have to pick one that old clients recognize. I'd
prefer "InvalidTxnStateException" but that one is supposed to be
returned from txn coordinators only today. I'd suggest we do a quick
check in the current client's code path and s

[GitHub] [kafka-site] bbejeck commented on pull request #482: MINOR: Rename description of flatMapValues transformation

2023-01-20 Thread GitBox


bbejeck commented on PR #482:
URL: https://github.com/apache/kafka-site/pull/482#issuecomment-1398623164

   Thanks @maseiler !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka-site] bbejeck merged pull request #482: MINOR: Rename description of flatMapValues transformation

2023-01-20 Thread GitBox


bbejeck merged PR #482:
URL: https://github.com/apache/kafka-site/pull/482


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[VOTE] 3.4.0 RC1

2023-01-20 Thread Sophie Blee-Goldman
Hello Kafka users, developers and client-developers,

This is the first candidate for release of Apache Kafka 3.4.0. Some of the
major features include:

* KIP-881: Rack-aware Partition Assignment for Kafka Consumers
<
https://cwiki.apache.org/confluence/display/KAFKA/KIP-881%3A+Rack-aware+Partition+Assignment+for+Kafka+Consumers>
(protocol changes only)

* KIP-876: Time based cluster metadata snapshots
<
https://cwiki.apache.org/confluence/display/KAFKA/KIP-876%3A+Time+based+cluster+metadata+snapshots
>

* KIP-787: MM2 manage Kafka resources with custom Admin implementation.


* KIP-866 ZooKeeper to KRaft Migration
<
https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
>
(Early
Access)

For a full list of the features in this release, please refer to the
release notes:
https://home.apache.org/~ableegoldman/kafka-3.4.0-rc1/RELEASE_NOTES.html

*** Please download, test and vote by Friday, Jan 27th, 9am PT

Kafka's KEYS file containing PGP keys we use to sign the release:
https://kafka.apache.org/KEYS

* Release artifacts to be voted upon (source and binary):
https://home.apache.org/~ableegoldman/kafka-3.4.0-rc1/

* Maven artifacts to be voted upon:
https://repository.apache.org/content/groups/staging/org/apache/kafka/

* Javadoc:
https://home.apache.org/~ableegoldman/kafka-3.4.0-rc1/javadoc/

* Tag to be voted upon (off 3.4 branch) is the 3.4.0 tag:
https://github.com/apache/kafka/releases/tag/3.4.0-rc1

* Documentation:
https://kafka.apache.org/34/documentation.html

* Protocol:
https://kafka.apache.org/34/protocol.html

* Still working on getting a fully green build for both unit/integration
and system tests. I will update this thread with the successful builds as
we get them.

/**

Thanks,
Sophie Blee-Goldman


[GitHub] [kafka-site] maseiler opened a new pull request, #482: MINOR: Rename description of flatMapValues transformation

2023-01-20 Thread GitBox


maseiler opened a new pull request, #482:
URL: https://github.com/apache/kafka-site/pull/482

   The table of (stateless) transformations uses the transformation name in the 
first column and a description in the second column. I adjusted the 
transformation name for FlatMapValues accordingly.
   
   See also [Kafka #8431](https://github.com/apache/kafka/pull/8431)
   
   @bbejeck 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.4 #46

2023-01-20 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 435732 lines...]
[2023-01-20T12:40:16.565Z] > Task :clients:javadoc
[2023-01-20T12:40:16.565Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginCallbackHandler.java:151:
 warning - Tag @link: reference not found: 
[2023-01-20T12:40:18.781Z] 
[2023-01-20T12:40:18.781Z] > Task :streams:javadoc
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:890:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:919:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:939:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:854:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:890:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:919:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java:939:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/Produced.java:84:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/Produced.java:136:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/Produced.java:147:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/Repartitioned.java:101:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/kstream/Repartitioned.java:167:
 warning - Tag @link: reference not found: DefaultPartitioner
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/TopologyConfig.java:62:
 warning - Tag @link: missing '#': "org.apache.kafka.streams.StreamsBuilder()"
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/TopologyConfig.java:62:
 warning - Tag @link: can't find org.apache.kafka.streams.StreamsBuilder() in 
org.apache.kafka.streams.TopologyConfig
[2023-01-20T12:40:18.781Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/TopologyDescription.java:38:
 warning - Tag @link: reference not found: ProcessorContext#forward(Object, 
Object) forwards
[2023-01-20T12:40:18.782Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/query/Position.java:44:
 warning - Tag @link: can't find query(Query,
[2023-01-20T12:40:18.782Z]  PositionBound, boolean) in 
org.apache.kafka.streams.processor.StateStore
[2023-01-20T12:40:18.782Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:109:
 warning - Tag @link: reference not found: this#getResult()
[2023-01-20T12:40:18.782Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:116:
 warning - Tag @link: reference not found: this#getFailureReason()
[2023-01-20T12:40:18.782Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:116:
 warning - Tag @link: reference not found: this#getFailureMessage()
[2023-01-20T12:40:18.782Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:154:
 warning - Tag @link: reference not found: this#isSuccess()
[2023-01-20T12:40:18.782Z] 
/home/jenkins/workspace/Kafka_kafka_3.4/streams/src/main/java/org/apache/kafka/streams/query/QueryResult.java:154:
 warning - Tag @link: reference not found: this#isFailure

Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #1525

2023-01-20 Thread Apache Jenkins Server
See 




Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.4 #45

2023-01-20 Thread Apache Jenkins Server
See 




[GitHub] [kafka-site] ableegoldman merged pull request #481: MINOR: Pre-release of Kafka 3.4.0 documentation and javadocs

2023-01-20 Thread GitBox


ableegoldman merged PR #481:
URL: https://github.com/apache/kafka-site/pull/481


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka-site] ableegoldman opened a new pull request, #481: MINOR: Pre-release of Kafka 3.4.0 documentation and javadocs

2023-01-20 Thread GitBox


ableegoldman opened a new pull request, #481:
URL: https://github.com/apache/kafka-site/pull/481

   This adds the site documentation and javadocs for 3.4 generated with RC1. 
This PR does not include the updated reference in the top-level 
documentation.html file. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org