[ 
https://issues.apache.org/jira/browse/KAFKA-14419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779731#comment-17779731
 ] 

A. Sophie Blee-Goldman commented on KAFKA-14419:
------------------------------------------------

Hey, sorry for the long delay – I'm still trying to catch up my memory of this 
ticket and the related one, but after looking at it again with fresh eyes I 
think I figured out what's going on here. If I'm reading this situation 
correctly, it does seem like there is some less-than-ideal behavior that we 
might be able to improve. 

Based on your recent logs, I think the root cause here is basically the same as 
what I fixed in 
[https://github.com/apache/kafka/pull/12869|https://github.com/apache/kafka/pull/12869,],
 just to a lesser degree. The issue in that patch was that Streams would 
sometimes trigger a followup rebalance even while the current rebalance was 
still going on, which lead some members to drop out of the group upon hitting a 
REBALANCE_IN_PROGRESS error during the SyncGroup phase. The fix basically just 
made the StreamThread wait until the rebalance was over before triggering a 
followup.

This should have been sufficient, but I suppose it is still theoretically 
possible to run into the same issue. Taking a deeper look at the original 
issue, it would only arise because of how Streams uses a non-blocking poll 
which allows it to return to its main loop and continue processing in the 
background during a rebalance. A lot of things happen throughout the loop, but 
the relevant operations here are as such:
 # Check the rebalance "schedule" and trigger one if:
 ## it has been requested for a time equal to or less than the current time
 ## the consumer is not actively participating in a rebalance (ie sometime 
after a SyncGroup response is received but before sending a new JoinGroup 
request)
 # Poll for more records, during which time either or both of the following may 
occur:
 ## consumer enters a new rebalance by sending a JoinGroup request
 ## consumer participates in a rebalance by receiving the JoinGroup response 
and sending a SyncGroup request
 ## consumer completes an ongoing rebalance by receiving a SyncGroup response, 
after which it can commit offsets for revoked tasks and initialize new ones
 # Process more records, which might have been either:
 ## Newly-consumed during the last poll call, or
 ## Left over from a previous batch that could not be fully processed before 
needing to return to poll due to running out of time in the max.poll.interval

So here's what I'm imagining: let's say we have two consumer, A and B, with A 
being the group leader/assignor.
 # A new rebalance begins, and both threads send their JoinGroup requests 
before returning to process some records
 # A doesn't have many records left to process, so it quickly returns to the 
poll call in step 2 of the loop. However B is still processing a large backlog
 # A performs the assignment and determines that a followup rebalance is 
needed, so it sets the rebalance schedule to <now>
 # After the assignment, A sends it out in the SyncGroup request and exits the 
poll call
 # A does some processing (or not) before returning to the poll and receiving 
the SyncGroup response
 # A exits the poll again, and this time when it reaches step 1 of the loop, it 
is now able to trigger the new rebalance
 # After A has requested a new rebalance, it finally returns to the poll call 
one more time, and rejoins the group/sends a JoinGroup request to kick it off
 # This whole time, B has had a large backlog of records, or a very high 
max.poll.interval, or a long GC pause – you get the idea. It's stuck in step 3
 # B finally finishes processing and leaves step 3, returning to the poll call 
during which it sends a very late SyncGroup request.
 # When the SyncGroup response is eventually received, B gets the 
REBALANCE_IN_PROGRESS error and fails its rebalance since the generation is 
stale

The fundamental issue here is that B is theoretically able to spend up to the 
max.poll.interval between sending its SyncGroup request and returning to poll 
to process the SyncGroup response, but A might be able to process its SyncGroup 
response, process its records, and then trigger a new rebalance all in that 
timeframe. This could happen when the task assignment is heavily imbalanced, 
for example. 

I can see a few potential paths forward here, and a fourth option that is more 
of a temporary workaround for [~Carlstedt] if you're still encountering this. 
None of them are really a guarantee, but they would help. For the most 
comprehensive fix we might want to consider doing two or even all three of 
these:

Option 1: add a small delay to the Streams followup rebalance trigger to help 
the entire group finish the SyncGroup phase before beginning the next rebalance.

Option 2: set a shorter upper bound on the maximum time a StreamThread can 
spend processing records while there is an active rebalance

Option 3: use a slightly larger duration for the poll timeout when a rebalance 
is in progress (I don't love this because it would need to be relatively large 
to actually help and wouldn't help at all if it's not long enough for the 
consumer to send a SyncGroup request and receive the SyncGroup response in the 
same poll, and a larger poll time will slow down processing during a rebalance)

Bonus Option 4 (workaround): decrease the max.poll.records and/or the 
max.poll.interval.ms – the less B has to do in between polls, the more likely 
it can return to poll in time to finish the rebalance. 

Personally, I think the combination of options 1 & 2 is the best choice, since 
they are additive with each other, but I'm interested in what others have to 
say. Thoughts anyone?

> Same message consumed again by the same stream task after partition is lost 
> and reassigned
> ------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-14419
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14419
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.3.1
>         Environment: AWS EC2 CentOS Linux 3.10.0-1160.76.1.el7.x86_64
>            Reporter: Mikael
>            Priority: Major
>
> Trigger scenario:
> Four Kafka client application instances on separate EC2 instances with a 
> total of 8 active and 8 standby stream tasks for the same stream topology, 
> consuming from an input topic with 8 partitions. Sometimes a handful of 
> messages are consumed twice by one of the stream tasks when stream tasks on 
> another application instance join the consumer group after an application 
> instance restart.
> Additional information:
> Messages are produced to the topic by another Kafka streams topology deployed 
> on the same four application instances. I have verified that each message is 
> only produced once by enabling debug logging in the topology flow right 
> before producing each message to the topic.
> Logs from stream thread with duplicate consumption:
>  
> {code:java}
> 2022-11-21 15:09:33,677 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:1066] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Request joining group due to: group is 
> already rebalancing
> 2022-11-21 15:09:33,677 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:566] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] (Re-)joining group
> Input records consumed for the first time
> 2022-11-21 15:09:33,919 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:627] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Successfully joined group with 
> generation Generation{generationId=8017, 
> memberId='messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer-77a68a5a-fb15-4808-9d87-30f21eabea74',
>  protocol='stream'}
> 2022-11-21 15:09:33,920 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:826] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] SyncGroup failed: The group began 
> another rebalance. Need to re-join the group. Sent generation was 
> Generation{generationId=8017, 
> memberId='messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer-77a68a5a-fb15-4808-9d87-30f21eabea74',
>  protocol='stream'}
> 2022-11-21 15:09:33,922 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:1019] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Resetting generation due to: 
> encountered REBALANCE_IN_PROGRESS from SYNC_GROUP response
> 2022-11-21 15:09:33,922 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:1066] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Request joining group due to: 
> encountered REBALANCE_IN_PROGRESS from SYNC_GROUP response
> 2022-11-21 15:09:33,923 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:819] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Giving away all assigned partitions as 
> lost since generation/memberID has been reset,indicating that consumer is in 
> old state or no longer part of the group
> 2022-11-21 15:09:33,923 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:354] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Lost previously assigned partitions 
> messages.xms.mt.batch.enqueue.sms-1
> 2022-11-21 15:09:33,923 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamThread [StreamsRebalanceListener.java:104] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  at state RUNNING: partitions [messages.xms.mt.batch.enqueue.sms-1] lost due 
> to missed rebalance.
>         lost active tasks: [0_1]
>         lost assigned standby tasks: []
> 2022-11-21 15:09:33,941 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamTask [StreamTask.java:1220] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  task [0_1] Suspended RUNNING
> 2022-11-21 15:09:33,941 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamTask [StreamTask.java:295] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  task [0_1] Suspended running
> 2022-11-21 15:09:33,941 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.KafkaConsumer [KafkaConsumer.java:1082] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-restore-consumer,
>  groupId=null] Unsubscribed all topics or patterns and assigned partitions
> 2022-11-21 15:09:33,942 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.RecordCollectorImpl [RecordCollectorImpl.java:333] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  task [0_1] Closing record collector dirty
> 2022-11-21 15:09:33,942 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamTask [StreamTask.java:537] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  task [0_1] Closed dirty
> 2022-11-21 15:09:33,942 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamThread [StreamsRebalanceListener.java:117] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  partitions lost took 19 ms.
> 2022-11-21 15:09:33,942 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:1066] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Request joining group due to: rebalance 
> failed due to 'The group is rebalancing, so a rejoin is needed.' 
> (RebalanceInProgressException)
> 2022-11-21 15:09:33,942 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:566] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] (Re-)joining group
> 2022-11-21 15:09:35,391 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:627] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Successfully joined group with 
> generation Generation{generationId=8018, 
> memberId='messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer-77a68a5a-fb15-4808-9d87-30f21eabea74',
>  protocol='stream'}
> 2022-11-21 15:09:35,395 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [AbstractCoordinator.java:802] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Successfully synced group in generation 
> Generation{generationId=8018, 
> memberId='messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer-77a68a5a-fb15-4808-9d87-30f21eabea74',
>  protocol='stream'}
> 2022-11-21 15:09:35,396 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:428] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Updating assignment with
>         Assigned partitions:                       
> [messages.xms.mt.batch.enqueue.sms-1]
>         Current owned partitions:                  []
>         Added partitions (assigned - owned):       
> [messages.xms.mt.batch.enqueue.sms-1]
>         Revoked partitions (owned - assigned):     []
> 2022-11-21 15:09:35,396 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:300] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Notifying assignor about the new 
> Assignment(partitions=[messages.xms.mt.batch.enqueue.sms-1], userDataSize=52)
> 2022-11-21 15:09:35,396 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamsPartitionAssignor [StreamsPartitionAssignor.java:1361] 
> stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer]
>  No followup rebalance was requested, resetting the rebalance schedule.
> 2022-11-21 15:09:35,396 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.TaskManager [TaskManager.java:273] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  Handle new assignment with:
>         New active tasks: [0_1]
>         New standby tasks: []
>         Existing active tasks: []
>         Existing standby tasks: []
> 2022-11-21 15:09:35,396 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:312] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Adding newly assigned partitions: 
> messages.xms.mt.batch.enqueue.sms-1
> 2022-11-21 15:09:35,396 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamThread [StreamThread.java:234] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  State transition from RUNNING to PARTITIONS_ASSIGNED
> 2022-11-21 15:09:35,398 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.i.ConsumerCoordinator [ConsumerCoordinator.java:968] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Setting offset for partition 
> messages.xms.mt.batch.enqueue.sms-1 to the committed offset 
> FetchPosition{offset=26744389, offsetEpoch=Optional.empty, 
> currentLeader=LeaderAndEpoch{leader=Optional[b-1.us1tst-eventbus.ujvxjq.c5.kafka.us-east-1.amazonaws.com:9094
>  (id: 1 rack: use1-az6)], epoch=19}}
> 2022-11-21 15:09:35,444 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamTask [StreamTask.java:235] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  task [0_1] Initialized
> 2022-11-21 15:09:35,445 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamTask [StreamTask.java:260] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  task [0_1] Restored and ready to run
> 2022-11-21 15:09:35,445 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamThread [StreamThread.java:866] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  Restoration took 49 ms for all tasks [0_1]
> 2022-11-21 15:09:35,445 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.p.i.StreamThread [StreamThread.java:234] stream-thread 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  State transition from PARTITIONS_ASSIGNED to RUNNING
> 22022-11-21 15:09:35,446 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.s.KafkaStreams [KafkaStreams.java:342] stream-client 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b] State 
> transition from REBALANCING to RUNNING
> 2022-11-21 15:09:35,446 INFO 
> [messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1]
>  o.a.k.c.c.KafkaConsumer [KafkaConsumer.java:2270] [Consumer 
> clientId=messages.xms.mt.enqueue.sms-acde244d-99b7-4237-83a8-ebce274fb77b-StreamThread-1-consumer,
>  groupId=messages.xms.mt.enqueue.sms] Requesting the log end offset for 
> messages.xms.mt.batch.enqueue.sms-1 in order to compute lag
> Same input records consumed for the second time{code}
> Streams consumer configuration:
> {noformat}
>         allow.auto.create.topics = false
>         auto.commit.interval.ms = 5000
>         auto.offset.reset = earliest
>         bootstrap.servers = 
> [b-3.us1tst-eventbus.ujvxjq.c5.kafka.us-east-1.amazonaws.com:9094, 
> b-1.us1tst-eventbus.ujvxjq.c5.kafka.us-east-1.amazonaws.com:9094, 
> b-2.us1tst-eventbus.ujvxjq.c5.kafka.us-east-1.amazonaws.com:9094]
>         check.crcs = true
>         client.dns.lookup = use_all_dns_ips
>         client.id = 
> messages.xms.mms.mt-05bfc9d3-7f4b-48d4-9c8c-cf9d3e496fef-StreamThread-1-consumer
>         client.rack = 
>         connections.max.idle.ms = 540000
>         default.api.timeout.ms = 60000
>         enable.auto.commit = false
>         exclude.internal.topics = true
>         fetch.max.bytes = 52428800
>         fetch.max.wait.ms = 500
>         fetch.min.bytes = 1
>         group.id = messages.xms.mms.mt
>         group.instance.id = null
>         heartbeat.interval.ms = 1500
>         interceptor.classes = []
>         internal.leave.group.on.close = true
>         internal.throw.on.fetch.stable.offset.unsupported = false
>         isolation.level = read_committed
>         key.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>         max.partition.fetch.bytes = 1048576
>         max.poll.interval.ms = 300000
>         max.poll.records = 1000
>         metadata.max.age.ms = 300000
>         metric.reporters = []
>         metrics.num.samples = 2
>         metrics.recording.level = INFO
>         metrics.sample.window.ms = 30000
>         partition.assignment.strategy = 
> [org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor]
>         receive.buffer.bytes = 65536
>         reconnect.backoff.max.ms = 1000
>         reconnect.backoff.ms = 50
>         request.timeout.ms = 30000
>         retry.backoff.ms = 100
>         sasl.client.callback.handler.class = null
>         sasl.jaas.config = null
>         sasl.kerberos.kinit.cmd = /usr/bin/kinit
>         sasl.kerberos.min.time.before.relogin = 60000
>         sasl.kerberos.service.name = null
>         sasl.kerberos.ticket.renew.jitter = 0.05
>         sasl.kerberos.ticket.renew.window.factor = 0.8
>         sasl.login.callback.handler.class = null
>         sasl.login.class = null
>         sasl.login.connect.timeout.ms = null
>         sasl.login.read.timeout.ms = null
>         sasl.login.refresh.buffer.seconds = 300
>         sasl.login.refresh.min.period.seconds = 60
>         sasl.login.refresh.window.factor = 0.8
>         sasl.login.refresh.window.jitter = 0.05
>         sasl.login.retry.backoff.max.ms = 10000
>         sasl.login.retry.backoff.ms = 100
>         sasl.mechanism = GSSAPI
>         sasl.oauthbearer.clock.skew.seconds = 30
>         sasl.oauthbearer.expected.audience = null
>         sasl.oauthbearer.expected.issuer = null
>         sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000
>         sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000
>         sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100
>         sasl.oauthbearer.jwks.endpoint.url = null
>         sasl.oauthbearer.scope.claim.name = scope
>         sasl.oauthbearer.sub.claim.name = sub
>         sasl.oauthbearer.token.endpoint.url = null
>         security.protocol = SSL
>         security.providers = null
>         send.buffer.bytes = 131072
>         session.timeout.ms = 6000
>         socket.connection.setup.timeout.max.ms = 30000
>         socket.connection.setup.timeout.ms = 10000
>         ssl.cipher.suites = null
>         ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
>         ssl.endpoint.identification.algorithm = https
>         ssl.engine.factory.class = null
>         ssl.key.password = null
>         ssl.keymanager.algorithm = SunX509
>         ssl.keystore.certificate.chain = null
>         ssl.keystore.key = null
>         ssl.keystore.location = /opt/apps/msl/xms-gateway/conf/xms.us1tst.jks
>         ssl.keystore.password = [hidden]
>         ssl.keystore.type = JKS
>         ssl.protocol = TLSv1.3
>         ssl.provider = null
>         ssl.secure.random.implementation = null
>         ssl.trustmanager.algorithm = PKIX
>         ssl.truststore.certificates = null
>         ssl.truststore.location = null
>         ssl.truststore.password = null
>         ssl.truststore.type = JKS
>         value.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer{noformat}
>  
> The message about lost partition that is highlighted in red above only occurs 
> when messages are consumed twice, which happens roughly two times out of ten 
> in my application restart test scenario.
> This issue no longer occurs when the patch suggested in KAFKA-14362 is 
> applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to