[ https://issues.apache.org/jira/browse/KAFKA-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-8526. ------------------------------------ Resolution: Fixed Fix Version/s: 2.4.0 > Broker may select a failed dir for new replica even in the presence of other > live dirs > -------------------------------------------------------------------------------------- > > Key: KAFKA-8526 > URL: https://issues.apache.org/jira/browse/KAFKA-8526 > Project: Kafka > Issue Type: Bug > Affects Versions: 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1 > Reporter: Anna Povzner > Assignee: Igor Soarez > Priority: Major > Fix For: 2.4.0 > > > Suppose a broker is configured with multiple log dirs. One of the log dirs > fails, but there is no load on that dir, so the broker does not know about > the failure yet, _i.e._, the failed dir is still in LogManager#_liveLogDirs. > Suppose a new topic gets created, and the controller chooses the broker with > failed log dir to host one of the replicas. The broker gets LeaderAndIsr > request with isNew flag set. LogManager#getOrCreateLog() selects a log dir > for the new replica from _liveLogDirs, then one two things can happen: > 1) getAbsolutePath can fail, in which case getOrCreateLog will throw an > IOException > 2) Creating directory for new the replica log may fail (_e.g._, if directory > becomes read-only, so getAbsolutePath worked). > In both cases, the selected dir will be marked offline (which is correct). > However, LeaderAndIsr will return an error and replica will be marked > offline, even though the broker may have other live dirs. > *Proposed solution*: Broker should retry selecting a dir for the new replica, > if initially selected dir threw an IOException when trying to create a > directory for the new replica. We should be able to do that in > LogManager#getOrCreateLog() method, but keep in mind that > logDirFailureChannel.maybeAddOfflineLogDir does not synchronously removes the > dir from _liveLogDirs. So, it makes sense to select initial dir by calling > LogManager#nextLogDir (current implementation), but if we fail to create log > on that dir, one approach is to select next dir from _liveLogDirs in > round-robin fashion (until we get to initial log dir – the case where all > dirs failed). -- This message was sent by Atlassian JIRA (v7.6.14#76016)