Anna Povzner created KAFKA-8526:
-----------------------------------

             Summary: Broker may select a failed dir for new replica even in 
the presence of other live dirs
                 Key: KAFKA-8526
                 URL: https://issues.apache.org/jira/browse/KAFKA-8526
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 2.2.1, 2.1.1, 2.0.1, 1.1.1, 2.3.0
            Reporter: Anna Povzner


Suppose a broker is configured with multiple log dirs. One of the log dirs 
fails, but there is no load on that dir, so the broker does not know about the 
failure yet, _i.e._, the failed dir is still in LogManager#_liveLogDirs. 
Suppose a new topic gets created, and the controller chooses the broker with 
failed log dir to host one of the replicas. The broker gets LeaderAndIsr 
request with isNew flag set. LogManager#getOrCreateLog() selects a log dir for 
the new replica from _liveLogDirs, then one two things can happen:
1) getAbsolutePath can fail, in which case getOrCreateLog will throw an 
IOException
2) Creating directory for new the replica log may fail (_e.g._, if directory 
becomes read-only, so getAbsolutePath worked). 

In both cases, the selected dir will be marked offline (which is correct). 
However, LeaderAndIsr will return an error and replica will be marked offline, 
even though the broker may have other live dirs. 

*Proposed solution*: Broker should retry selecting a dir for the new replica, 
if initially selected dir threw an IOException when trying to create a 
directory for the new replica. We should be able to do that in 
LogManager#getOrCreateLog() method, but keep in mind that 
logDirFailureChannel.maybeAddOfflineLogDir does not synchronously removes the 
dir from _liveLogDirs. So, it makes sense to select initial dir by calling 
LogManager#nextLogDir (current implementation), but if we fail to create log on 
that dir, one approach is to select next dir from _liveLogDirs in round-robin 
fashion (until we get to initial log dir – the case where all dirs failed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to