Lucas Bradstreet created KAFKA-9137:
---------------------------------------

             Summary: Maintenance of FetchSession cache causing 
FETCH_SESSION_ID_NOT_FOUND in live sessions
                 Key: KAFKA-9137
                 URL: https://issues.apache.org/jira/browse/KAFKA-9137
             Project: Kafka
          Issue Type: Bug
          Components: core
            Reporter: Lucas Bradstreet


We have recently seen cases where brokers end up in a bad state where fetch 
session evictions occur at a high rate (> 16 per second) after a roll. This 
increase in eviction rate included the following pattern in our logs:

 
{noformat}
broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
FetchContext for session id 2046264334, epoch 9790: added (), updated (), 
removed ()

broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
FetchContext for session id 2046264334, epoch 9791: added (), updated (), 
removed () broker 6: October 31st 2019, 17:52:45.500 Created a new incremental 
FetchContext for session id 2046264334, epoch 9792: added (), updated 
(lkc-7nv6o_tenant_soak_topic_144p-67), removed () 

broker 6: October 31st 2019, 17:52:45.501 Created a new incremental 
FetchContext for session id 2046264334, epoch 9793: added (), updated 
(lkc-7nv6o_tenant_soak_topic_144p-59, lkc-7nv6o_tenant_soak_topic_144p-123, 
lkc-7nv6o_tenant_soak_topic_144p-11, lkc-7nv6o_tenant_soak_topic_144p-3, 
lkc-7nv6o_tenant_soak_topic_144p-67, lkc-7nv6o_tenant_soak_topic_144p-115), 
removed () 

broker 6: October 31st 2019, 17:52:45.501 Evicting stale FetchSession 
2046264334. 

broker 6: October 31st 2019, 17:52:45.502 Session error for 2046264334: no such 
session ID found. 

broker 4: October 31st 2019, 17:52:45.813 [ReplicaFetcher replicaId=4, 
leaderId=6, fetcherId=0] Node 6 was unable to process the fetch request with 
(sessionId=2046264334, epoch=9793): FETCH_SESSION_ID_NOT_FOUND.  
{noformat}
This pattern appears to be problematic for two reasons. Firstly, the replica 
fetcher for broker 4 was clearly able to send multiple incremental fetch 
requests to broker 6, and receive replies, and did so right up to the point 
where broker 6 evicted its fetch session within milliseconds of multiple fetch 
requests. The second problem is that replica fetchers are considered privileged 
for the fetch session cache, and should not be evicted by consumer fetch 
sessions. This cluster only has 12 brokers and 1000 fetch session cache slots 
(the default for max.incremental.fetch.session.cache.slots), and it thus very 
unlikely that this session should have been evicted by another replica fetcher 
session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to