[ 
https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989287#comment-16989287
 ] 

Michael Jaschob commented on KAFKA-9212:
----------------------------------------

Chiming in here, I believe we've experienced the same error. I've been able to 
reproduce the behavior quite simply, as follows:
 - 3-broker cluster (running Apache Kafka 2.3.1)
 - one partition with replica assignment (0, 1, 2)
 - booted fourth broker (id 3)
 - initiated partition reassignment from (0, 1, 2) to (0, 1, 2, 3) with a very 
low throttle (for testing)

As soon as the assignment begins, a 2.3.0 console consumer simply hangs when 
started. A 1.1.1 consumer does not have any issues. I see this in leader 
broker's request logs:
{code:java}
[2019-12-05 16:38:36,790] DEBUG Completed 
request:RequestHeader(apiKey=LIST_OFFSETS, apiVersion=5, clientId=consumer-1, 
correlationId=1529) -- 
{replica_id=-1,isolation_level=0,topics=[{topic=DataPlatform.CGSynthTests,partitions=[{partition=0,current_leader_epoch=0,timestamp=-1}]}]},response:{throttle_time_ms=0,responses=[{topic=DataPlatform.CGSynthTests,partition_responses=[{partition=0,error_code=74,timestamp=-1,offset=-1,leader_epoch=-1}]}]}
 from connection 
172.22.15.67:9092-172.22.23.98:46974-9;totalTime:0.27,requestQueueTime:0.044,localTime:0.185,remoteTime:0.0,throttleTime:0.036,responseQueueTime:0.022,sendTime:0.025,securityProtocol:PLAINTEXT,principal:User:data-pipeline-monitor,listener:PLAINTEXT
 (kafka.request.logger)
{code}
Note the producer fenced error code for list offsets, as in the original report.

Once the reassignment completes, the 2.3.1 console consumer starts working. 
I've also tried a different reassignment (0, 1, 2) -> (3, 1, 2) with the same 
results.

Where we stand right now is we can't initiate partition reassignments in our 
production cluster without paralyzing a Spark application (using 2.3.0 client 
libs under the hood). Downgrading the Kafka client libs there isn't possible 
since they are part of the Spark assembly.

Any pointers on what the issue might be here? Struggling to understand the bug 
because it seems like any partition reassignment breaks LIST_OFFSETS requests 
from 2.3 clients, but that just seems to be too severe a problem to have gone 
unnoticed for so long. Even ideas for a workaround would help here, since we 
don't see a path to do partition reassignments without causing a production 
incident right now.

> Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
> ------------------------------------------------------------------
>
>                 Key: KAFKA-9212
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9212
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, offset manager
>    Affects Versions: 2.3.0
>         Environment: Linux
>            Reporter: Yannick
>            Priority: Critical
>
> When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
> broker got restarted (leaderEpoch updated at this point), the connect worker 
> crashed with the following error : 
> [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, 
> groupId=connect-ls] Uncaught exception in herder work thread, exiting: 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
>  org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
> times in 30003ms
>  
> After investigation, it seems it's because it got fenced when sending 
> ListOffsetRequest in loop and then got timed out , as follows :
> [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
> replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
> maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
> isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
> rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)
> [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Attempt to fetch offsets for partition 
> connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. 
> (org.apache.kafka.clients.consumer.internals.Fetcher:985)
>  
> The above happens multiple times until timeout.
>  
> According to the debugs, the consumer always get a leaderEpoch of 1 for this 
> topic when starting up :
>  
>  [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
> connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
>   
>   
>  But according to our brokers log, the leaderEpoch should be 2, as follows :
>   
>  [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
> connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
> Epoch was: 1 (kafka.cluster.Partition)
>   
>   
>  This make impossible to restart the worker as it will always get fenced and 
> then finally timeout.
>   
>  It is also impossible to consume with a 2.3 kafka-console-consumer as 
> follows :
>   
>  kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
> connect_ls_config --from-beginning 
>   
>  the above will just hang forever ( which is not expected cause there is 
> data) and we can see those debug messages :
> [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, 
> groupId=console-consumer-3844] Attempt to fetch offsets for partition 
> connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
>   
>   
>  Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we 
> can consume without problem ( must be the way kafkacat is consuming ignoring 
> FENCED_LEADER_EPOCH):
>   
>  kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to