Jason Gustafson created KAFKA-9840:
--------------------------------------

             Summary: Consumer should not use OffsetForLeaderEpoch without 
current epoch validation
                 Key: KAFKA-9840
                 URL: https://issues.apache.org/jira/browse/KAFKA-9840
             Project: Kafka
          Issue Type: Bug
          Components: consumer
            Reporter: Jason Gustafson


We have observed a case where the consumer attempted to detect truncation with 
the OffsetsForLeaderEpoch API against a broker which had become a zombie. In 
this case, the last epoch known to the consumer was higher than the last epoch 
known to the zombie broker, so the broker returned -1 as the offset and epoch 
in the response. The consumer did not check for this in the response, which 
resulted in the following message:

{code}
Truncation detected for partition topic-1 at offset FetchPosition{offset=11859, 
offsetEpoch=Optional[46], currentLeader=LeaderAndEpoch{leader=broker-host (id: 
3 rack: null), epoch=-1}}, resetting offset to the first offset known to 
diverge FetchPosition{offset=-1, offsetEpoch=Optional[-1], 
currentLeader=LeaderAndEpoch{broker-host (id: 3 rack: null), epoch=-1}} 
(org.apache.kafka.clients.consumer.internals.SubscriptionState:414)
{code}

There are a couple ways we the consumer can handle this situation better. 
First, the reason we did not detect the zombie broker is that we did not 
include the current leader epoch in the OffsetForLeaderEpoch request. This was 
likely because of KAFKA-9212. Following this patch, we would not initialize the 
current leader epoch from metadata responses because there are cases that we 
cannot rely on it. But if the client cannot rely on being able to detect 
zombies, then the epoch validation is less useful anyway. So the simple 
solution is to not bother with the validation unless we have a reliable current 
leader epoch.

Second, the consumer needs to check for the case when the returned offset and 
epoch are not defined. In this case, we have to treat this as a normal 
OffsetOutOfRange case and invoke the reset policy. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to