Lucas Bradstreet created KAFKA-9137:
---------------------------------------
Summary: Maintenance of FetchSession cache causing
FETCH_SESSION_ID_NOT_FOUND in live sessions
Key: KAFKA-9137
URL: https://issues.apache.org/jira/browse/KAFKA-9137
Project: Kafka
Issue Type: Bug
Components: core
Reporter: Lucas Bradstreet
We have recently seen cases where brokers end up in a bad state where fetch
session evictions occur at a high rate (> 16 per second) after a roll. This
increase in eviction rate included the following pattern in our logs:
{noformat}
broker 6: October 31st 2019, 17:52:45.496 Created a new incremental
FetchContext for session id 2046264334, epoch 9790: added (), updated (),
removed ()
broker 6: October 31st 2019, 17:52:45.496 Created a new incremental
FetchContext for session id 2046264334, epoch 9791: added (), updated (),
removed () broker 6: October 31st 2019, 17:52:45.500 Created a new incremental
FetchContext for session id 2046264334, epoch 9792: added (), updated
(lkc-7nv6o_tenant_soak_topic_144p-67), removed ()
broker 6: October 31st 2019, 17:52:45.501 Created a new incremental
FetchContext for session id 2046264334, epoch 9793: added (), updated
(lkc-7nv6o_tenant_soak_topic_144p-59, lkc-7nv6o_tenant_soak_topic_144p-123,
lkc-7nv6o_tenant_soak_topic_144p-11, lkc-7nv6o_tenant_soak_topic_144p-3,
lkc-7nv6o_tenant_soak_topic_144p-67, lkc-7nv6o_tenant_soak_topic_144p-115),
removed ()
broker 6: October 31st 2019, 17:52:45.501 Evicting stale FetchSession
2046264334.
broker 6: October 31st 2019, 17:52:45.502 Session error for 2046264334: no such
session ID found.
broker 4: October 31st 2019, 17:52:45.813 [ReplicaFetcher replicaId=4,
leaderId=6, fetcherId=0] Node 6 was unable to process the fetch request with
(sessionId=2046264334, epoch=9793): FETCH_SESSION_ID_NOT_FOUND.
{noformat}
This pattern appears to be problematic for two reasons. Firstly, the replica
fetcher for broker 4 was clearly able to send multiple incremental fetch
requests to broker 6, and receive replies, and did so right up to the point
where broker 6 evicted its fetch session within milliseconds of multiple fetch
requests. The second problem is that replica fetchers are considered privileged
for the fetch session cache, and should not be evicted by consumer fetch
sessions. This cluster only has 12 brokers and 1000 fetch session cache slots
(the default for max.incremental.fetch.session.cache.slots), and it thus very
unlikely that this session should have been evicted by another replica fetcher
session.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)