Anna Povzner created KAFKA-7151:
Summary: Broker running out of disk space may result in state
where unclean leader election is required
Issue Type: Bug
Reporter: Anna Povzner
We have seen situations like the following:
1) Broker A is a leader for topic partition, and brokers B and C are the
2) Broker A is running out of disk space, shrinks ISR only to itself, and then
sometime later gets disk errors, etc.
3) Broker A is stopped, disk space is reclaimed, and broker A is restarted
Result: Broker A becomes a leader, but followers cannot fetch because their log
is ahead. The only way to continue is to enable unclean leader election.
There are several issues here:
-- if the machine is running out of disk space, we do not reliably get an error
from a file system as soon as that happens. The broker could be in a state
where some writes succeed (possibly if the write is not flushed to disk) and
some writes fails, or maybe fail later. This may cause fetchers fetch records
that are still in the leader's file system cache, and then the flush to disk
failing on the leader, causes followers to be ahead of the leader.
-- I am not sure exactly why, but it seems like the leader broker (that is
running out of disk space) may also stop servicing fetch requests making
followers fall behind and kicked out of ISR.
Ideally, the broker should stop being a leader for any topic partition before
accepting any records that may fail to be flushed to disk. One option is to
automatically detect disk space usage and make a broker read-only for topic
partitions if disk space gets to 80% or something. Maybe there is a better
This message was sent by Atlassian JIRA