Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #1233

2022-09-18 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 504709 lines...]
[2022-09-19T04:05:51.174Z] 
[2022-09-19T04:05:51.174Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorManyThreadsPerClient STARTED
[2022-09-19T04:05:52.133Z] 
[2022-09-19T04:05:52.134Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorManyThreadsPerClient PASSED
[2022-09-19T04:05:52.134Z] 
[2022-09-19T04:05:52.134Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorLargePartitionCount STARTED
[2022-09-19T04:06:32.298Z] 
[2022-09-19T04:06:32.298Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorLargePartitionCount PASSED
[2022-09-19T04:06:32.298Z] 
[2022-09-19T04:06:32.298Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testStickyTaskAssignorLargePartitionCount STARTED
[2022-09-19T04:07:07.101Z] 
[2022-09-19T04:07:07.102Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testStickyTaskAssignorLargePartitionCount PASSED
[2022-09-19T04:07:07.102Z] 
[2022-09-19T04:07:07.102Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorManyStandbys STARTED
[2022-09-19T04:07:16.031Z] 
[2022-09-19T04:07:16.031Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorManyStandbys PASSED
[2022-09-19T04:07:16.031Z] 
[2022-09-19T04:07:16.031Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testHighAvailabilityTaskAssignorManyStandbys STARTED
[2022-09-19T04:08:01.775Z] 
[2022-09-19T04:08:01.775Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testHighAvailabilityTaskAssignorManyStandbys PASSED
[2022-09-19T04:08:01.775Z] 
[2022-09-19T04:08:01.775Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorLargeNumConsumers STARTED
[2022-09-19T04:08:01.775Z] 
[2022-09-19T04:08:01.775Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testFallbackPriorTaskAssignorLargeNumConsumers PASSED
[2022-09-19T04:08:01.775Z] 
[2022-09-19T04:08:01.775Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testStickyTaskAssignorLargeNumConsumers STARTED
[2022-09-19T04:08:01.775Z] 
[2022-09-19T04:08:01.775Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > StreamsAssignmentScaleTest > 
testStickyTaskAssignorLargeNumConsumers PASSED
[2022-09-19T04:08:01.775Z] 
[2022-09-19T04:08:01.775Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
testConcurrentlyAccessThreads() STARTED
[2022-09-19T04:08:04.981Z] 
[2022-09-19T04:08:04.981Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
testConcurrentlyAccessThreads() PASSED
[2022-09-19T04:08:04.981Z] 
[2022-09-19T04:08:04.981Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
shouldResizeCacheAfterThreadReplacement() STARTED
[2022-09-19T04:08:09.612Z] 
[2022-09-19T04:08:09.612Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
shouldResizeCacheAfterThreadReplacement() PASSED
[2022-09-19T04:08:09.612Z] 
[2022-09-19T04:08:09.612Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
shouldAddAndRemoveThreadsMultipleTimes() STARTED
[2022-09-19T04:08:11.943Z] 
[2022-09-19T04:08:11.943Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
shouldAddAndRemoveThreadsMultipleTimes() PASSED
[2022-09-19T04:08:11.943Z] 
[2022-09-19T04:08:11.943Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
shouldnNotRemoveStreamThreadWithinTimeout() STARTED
[2022-09-19T04:08:14.272Z] 
[2022-09-19T04:08:14.272Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
shouldnNotRemoveStreamThreadWithinTimeout() PASSED
[2022-09-19T04:08:14.272Z] 
[2022-09-19T04:08:14.272Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 165 > AdjustStreamThreadCountTest > 
shouldAddAndRemoveStreamThreadsWhileKeepingNamesCorrect() STARTED
[2022-09-19T04:08:36.106Z] 

[jira] [Created] (KAFKA-14242) Hanging logManager in testReloadUpdatedFilesWithoutConfigChange test

2022-09-18 Thread Luke Chen (Jira)
Luke Chen created KAFKA-14242:
-

 Summary: Hanging logManager in 
testReloadUpdatedFilesWithoutConfigChange test
 Key: KAFKA-14242
 URL: https://issues.apache.org/jira/browse/KAFKA-14242
 Project: Kafka
  Issue Type: Test
Reporter: Luke Chen
Assignee: Luke Chen


Recently, we got a lot of build failed (and terminated) with core:unitTest 
failure. The failed messages look like this:
FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z] 
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with 
non-zero exit value 1{{}}
After investigation, I found one reason of it (maybe there are other reasons). 
In {{BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange}} 
test, we created logManager twice, but when cleanup, we only close one of them. 
So, there will be a log cleaner keeping running. But during this time, the temp 
log dirs are deleted, so it will {{{}Exit.halt(1){}}}, and got the error we saw 
in gradle, like this code did when we encounter IOException in all our log dirs:
fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have 
failed")
Exit.halt(1){{}}
And, why does it sometimes pass, sometimes failed? Because during test cluster 
close, we shutdown broker first, and then other components. And the log cleaner 
is triggered in an interval. So, if the cluster can close fast enough, and 
finish this test, it'll be passed. Otherwise, it'll exit with 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: KafkaConsumer refactor proposal

2022-09-18 Thread Philip Nee
On Sun, Sep 18, 2022 at 6:03 AM Luke Chen  wrote:

> Hi Philip,
>
> Thanks for the write-up.
>
Also thank you for taking the time to read the proposal.  Very grateful.

> Some questions:
>
> 1. You said: If we don't need a coordinator, the background thread will
> stay in the *initialized* state.
> But in the definition of *initialized, *it said:
> *initialized*: The background thread completes initialization, and the loop
> has started.
> Does that mean, even if we don't need a coordinator, we still create a
> coordinator and run the loop? And why?
>
If we don't need a coordinator, I think the background thread should not
fire a FindCoordinator request, which is what is implemented today right?
In terms of object instantiation, I think we can instantiate the object on
demand.

>
> 2. Why do we need so many rebalance states?
> After this proposal, we'll change the state number from 4 to 8.
> And we will have many state changes and state check in the code. Could you
> explain why those states are necessary?
> I can imagine, like the method: `rebalanceInProgress`, will need to check
> if the current state in
>
> PREPARE_REVOCATION/REVOKING_PARTITION/PARTITION_REVOKED/PREPARING_REBALANCE/.../...
> .
> So, maybe you should explain why these states are necessary. To me, like
> PREPARE_REVOCATION/REVOKING_PARTITION, I don't understand why we need 2
> states for them? Any reason there?
>
I think we need to do 3 things when revoking a partition: There are, before
the revocation, during the revocation, and after the revocation.
  1. Before: prepare the revocation, i.e., pausing the data, send out
commits (as of the current implementation), and send an event to the
polling thread to invoke the callback
  2. During: Waiting for the callback to be triggered.  Meanwhile, the
background thread will continue to poll the channel, until the ack event
arrives
  3. After: Once the invocation ack event is received, we move on to
request to join the group.

Maybe you are right, perhaps, we don't need the PARTITION_REVOKED state.

>
> 3. How could we handle the Subscription state change in poll thread earlier
> than REVOKE_PARTITION/ASSIGN_PARTITION events arrived?
> Suppose, background thread sending REVOKE_PARTITION event to poll thread,
> and before handing it, the consumer updates the subscription.
> So in this situation, we'll still invoke revoke callback or not?
> This won't happen in current design because they are handing in one thread.
>

Thanks for bringing this to the table. Would it make sense to lock the
subscriptionState in some scenarios like this? In this case, this requires
API changes, like throwing an exception to tell the users that the
revocation is happening.

>
> 4. *Is there a better way to configure session interval and heartbeat
> interval?*
> These configs are moved to broker side in KIP-848 IIRC.
> Maybe you can check that KIP and update here.
>
Thanks.

>
> Some typos and hard to understand
>
Hmm, terribly sorry about these typos.  Turned out I didn't publish the
updated copy.  Will address them and update the document right away.


> 5. Firstly, it _mcomplexfixes_ increasingly difficult ???
> 6. it simplifies the _pd_ design ? What is pd?
> 7. In "Background thread and its lifecycle" section, I guess the 2 points
> should be 3c and 3d, right?
> That is:
> a. Check if there is an in-flight event. If not, poll for the new events
> from the channel.
> b. Run the state machine, and here are the following scenario:
> c. Poll the networkClient.
> d. Backoff for retryBackoffMs milliseconds.
>
> Thank you.
>
Again, really grateful for you to review the proposal, so thank you!

P

> Luke
>
> On Sat, Sep 17, 2022 at 6:29 AM Philip Nee  wrote:
>
> > On Fri, Sep 16, 2022 at 3:01 PM Guozhang Wang 
> wrote:
> >
> > > Hi Philip,
> > >
> > > On Fri, Sep 16, 2022 at 1:20 PM Philip Nee 
> wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > Thank you for the reviews, and I agree with your suggestions: If we
> > find
> > > > any behavior changes, we will issue a KIP. Is it possible to first
> > > publish
> > > > the new implementation in a separate class so that we don't disrupt
> the
> > > > existing implementation? Then we can issue a KIP to address the
> > > > regressions.
> > > >
> > > > Yes! That's what I meant to clarify. Just in case other people may be
> > > confused why we did not propose a KIP right away, but first as a
> > one-pager.
> > >
> > >
> > > > To your questions:
> > > > 1. I think the coordinator discovery will happen on demand, i.e., the
> > > > background thread will only try to discover a coordinator when the
> > event
> > > > (e.g., commit) requires one. I've noted that, and I'll make sure that
> > in
> > > > the proposal.
> > > >
> > > Thanks. Could you then clarify that for the coordinator state
> transition?
> > > Is that just between `down` and `initialized`, and then `initialized`
> can
> > > then transition back to `down` too?
> > >
> >
> > Will do, I'll update the 1 pager to clarify it.

Re: [DISCUSS] KIP-868 Metadata Transactions (new thread)

2022-09-18 Thread Luke Chen
Hi David,

Thanks for the KIP!
It's a light-weight transactional proposal for single producer, cool!
+1 for it!

Luke


On Sat, Sep 10, 2022 at 1:14 AM David Arthur  wrote:

> Starting a new thread to avoid issues with mail client threading.
>
> Original thread follows:
>
> Hey folks, I'd like to start a discussion on the idea of adding
> transactions in the KRaft controller. This will allow us to overcome
> the current limitation of atomic batch sizes in Raft which lets us do
> things like create topics with a huge number of partitions.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-868+Metadata+Transactions
>
> Thanks!
>
> ---
>
> Colin McCabe said:
>
> Thanks for this KIP, David!
>
> In the "motivation" section, it might help to give a concrete example
> of an operation we want to be atomic. My favorite one is probably
> CreateTopics since it's easy to see that we want to create all of a
> topic or none of it, and a topic could be a potentially unbounded
> number of records (although hopefully people have reasonable create
> topic policy classes in place...)
>
> In "broker support", it would be good to clarify that we will buffer
> the records in the MetadataDelta and not publish a new MetadataImage
> until the transaction is over. This is an implementation detail, but
> it's a simple one and I think it will make it easier to understand how
> this works.
>
> In the "Raft Transactions" section of "Rejected Alternatives," I'd add
> that managing buffering in the Raft layer would be a lot less
> efficient than doing it in the controller / broker layer. We would end
> up accumulating big lists of records which would then have to be
> applied when the transaction completed, rather than building up a
> MetadataDelta (or updating the controller state) incrementally.
>
> Maybe we want to introduce the concept of "last stable offset" to be
> the last committed offset that is NOT part of an ongoing transaction?
> Just a nomenclature suggestion...
>
> best,
> Colin
>


Re: KafkaConsumer refactor proposal

2022-09-18 Thread Luke Chen
Hi Philip,

Thanks for the write-up.
Some questions:

1. You said: If we don't need a coordinator, the background thread will
stay in the *initialized* state.
But in the definition of *initialized, *it said:
*initialized*: The background thread completes initialization, and the loop
has started.
Does that mean, even if we don't need a coordinator, we still create a
coordinator and run the loop? And why?

2. Why do we need so many rebalance states?
After this proposal, we'll change the state number from 4 to 8.
And we will have many state changes and state check in the code. Could you
explain why those states are necessary?
I can imagine, like the method: `rebalanceInProgress`, will need to check
if the current state in
PREPARE_REVOCATION/REVOKING_PARTITION/PARTITION_REVOKED/PREPARING_REBALANCE/.../...
.
So, maybe you should explain why these states are necessary. To me, like
PREPARE_REVOCATION/REVOKING_PARTITION, I don't understand why we need 2
states for them? Any reason there?

3. How could we handle the Subscription state change in poll thread earlier
than REVOKE_PARTITION/ASSIGN_PARTITION events arrived?
Suppose, background thread sending REVOKE_PARTITION event to poll thread,
and before handing it, the consumer updates the subscription.
So in this situation, we'll still invoke revoke callback or not?
This won't happen in current design because they are handing in one thread.

4. *Is there a better way to configure session interval and heartbeat
interval?*
These configs are moved to broker side in KIP-848 IIRC.
Maybe you can check that KIP and update here.

Some typos and hard to understand
5. Firstly, it _mcomplexfixes_ increasingly difficult ???
6. it simplifies the _pd_ design ? What is pd?
7. In "Background thread and its lifecycle" section, I guess the 2 points
should be 3c and 3d, right?
That is:
a. Check if there is an in-flight event. If not, poll for the new events
from the channel.
b. Run the state machine, and here are the following scenario:
c. Poll the networkClient.
d. Backoff for retryBackoffMs milliseconds.

Thank you.
Luke

On Sat, Sep 17, 2022 at 6:29 AM Philip Nee  wrote:

> On Fri, Sep 16, 2022 at 3:01 PM Guozhang Wang  wrote:
>
> > Hi Philip,
> >
> > On Fri, Sep 16, 2022 at 1:20 PM Philip Nee  wrote:
> >
> > > Hi Guozhang,
> > >
> > > Thank you for the reviews, and I agree with your suggestions: If we
> find
> > > any behavior changes, we will issue a KIP. Is it possible to first
> > publish
> > > the new implementation in a separate class so that we don't disrupt the
> > > existing implementation? Then we can issue a KIP to address the
> > > regressions.
> > >
> > > Yes! That's what I meant to clarify. Just in case other people may be
> > confused why we did not propose a KIP right away, but first as a
> one-pager.
> >
> >
> > > To your questions:
> > > 1. I think the coordinator discovery will happen on demand, i.e., the
> > > background thread will only try to discover a coordinator when the
> event
> > > (e.g., commit) requires one. I've noted that, and I'll make sure that
> in
> > > the proposal.
> > >
> > Thanks. Could you then clarify that for the coordinator state transition?
> > Is that just between `down` and `initialized`, and then `initialized` can
> > then transition back to `down` too?
> >
>
> Will do, I'll update the 1 pager to clarify it.
>
> >
> >
> > > 2. Acked. I think I'll add a section about how each operation works in
> > the
> > > new implementation.
> >
> > 3a. I will modify the existing state enum in the new implementation.
> > > 3b. Ah, I must have missed this when proofreading the document. I think
> > the
> > > state should be:
> > >   Unjoined -> Commit_async -> Revoking_Partitions -> Partitions_Revoked
> > ->
> > > Join_Group -> Completed_Rebalancing -> Assinging_Partitions ->
> > > Partitions_Assigned -> Stable
> > >   I've made a note of that, and I'll update the document.
> > >
> > Got it, if we are introducing a third state for auto committing, then
> upon
> > completing the rebalance, we may also need to transit to that state since
> > we may also revoke partitions, right? This is for fixing KAFKA-14224
> >
> - Instead of introducing a new state, I think I want to break up the
> PREPARING_REBALANCE into Commit and Revoke.  So I think for your suggestion
> in 14224, I potentially we could do it this way.
> 1. REVOKING_PARTITION: mute the partition and send the callback event to
> the polling thread
> 2. PARTITION_REVOKED: Received the revocation ack.  Advance the state to
> COMMIT.
> 3. COMMIT: Commit the offsets.  Wait for the commit to finish, then we can
> start the join group.
>
> Again, thanks, I'll update these changes in the 1pager.
>
> >
> >
> > > 4. Noted. I'll add a section about exception handling.
> > > 5a.  Kirk also had the same comment. I updated the document.
> > > 5b. yes!
> > >
> > > Regarding the POC comments, I think the POC branch is actually quite
> > dated.
> > > I need to update it . Do you think we can start with simple