[
https://issues.apache.org/jira/browse/KAFKA-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson resolved KAFKA-8526.
------------------------------------
Resolution: Fixed
Fix Version/s: 2.4.0
> Broker may select a failed dir for new replica even in the presence of other
> live dirs
> --------------------------------------------------------------------------------------
>
> Key: KAFKA-8526
> URL: https://issues.apache.org/jira/browse/KAFKA-8526
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1
> Reporter: Anna Povzner
> Assignee: Igor Soarez
> Priority: Major
> Fix For: 2.4.0
>
>
> Suppose a broker is configured with multiple log dirs. One of the log dirs
> fails, but there is no load on that dir, so the broker does not know about
> the failure yet, _i.e._, the failed dir is still in LogManager#_liveLogDirs.
> Suppose a new topic gets created, and the controller chooses the broker with
> failed log dir to host one of the replicas. The broker gets LeaderAndIsr
> request with isNew flag set. LogManager#getOrCreateLog() selects a log dir
> for the new replica from _liveLogDirs, then one two things can happen:
> 1) getAbsolutePath can fail, in which case getOrCreateLog will throw an
> IOException
> 2) Creating directory for new the replica log may fail (_e.g._, if directory
> becomes read-only, so getAbsolutePath worked).
> In both cases, the selected dir will be marked offline (which is correct).
> However, LeaderAndIsr will return an error and replica will be marked
> offline, even though the broker may have other live dirs.
> *Proposed solution*: Broker should retry selecting a dir for the new replica,
> if initially selected dir threw an IOException when trying to create a
> directory for the new replica. We should be able to do that in
> LogManager#getOrCreateLog() method, but keep in mind that
> logDirFailureChannel.maybeAddOfflineLogDir does not synchronously removes the
> dir from _liveLogDirs. So, it makes sense to select initial dir by calling
> LogManager#nextLogDir (current implementation), but if we fail to create log
> on that dir, one approach is to select next dir from _liveLogDirs in
> round-robin fashion (until we get to initial log dir – the case where all
> dirs failed).
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)