Hi Harsha, Is the problem you'd like addressed the following?
Assume 3 replicas, L and F1 and F2. 1. F1 and F2 are alive and sending fetch requests to L. 2. L starts encountering disk issues, any requests being processed by the request handler threads become blocked. 3. L's zookeeper connection is still alive so it remains the leader for the partition. 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR to itself. While KIP-501 may help prevent a shrink in partitions where a replica fetch request has started processing, any fetch requests in the request queue will have no effect. Generally when these slow/failing disk issues occur, all of the request handler threads end up blocked and requests queue up in the request queue. For example, all of the request handler threads may end up stuck in KafkaApis.handleProduceRequest handling produce requests, at which point all of the replica fetcher fetch requests remain queued in the request queue. If this happens, there will be no tracked fetch requests to prevent a shrink. Solving this shrinking issue is tricky. It would be better if L resigns leadership when it enters a degraded state rather than avoiding a shrink. If L is no longer the leader in this situation, it will eventually become blocked fetching from the new leader and the new leader will shrink the ISR, kicking out L. Cheers, Lucas