[
https://issues.apache.org/jira/browse/KAFKA-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viktor Somogyi-Vass reassigned KAFKA-9118:
------------------------------------------
Assignee: David Arthur (was: Viktor Somogyi-Vass)
> LogDirFailureHandler shouldn't use Zookeeper
> --------------------------------------------
>
> Key: KAFKA-9118
> URL: https://issues.apache.org/jira/browse/KAFKA-9118
> Project: Kafka
> Issue Type: Improvement
> Reporter: Viktor Somogyi-Vass
> Assignee: David Arthur
> Priority: Major
>
> As described in
> [KIP-112|https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD#KIP-112:HandlediskfailureforJBOD-Zookeeper]:
> {noformat}
> 2. A log directory stops working on a broker during runtime
> - The controller watches the path /log_dir_event_notification for new znode.
> - The broker detects offline log directories during runtime.
> - The broker takes actions as if it has received StopReplicaRequest for this
> replica. More specifically, the replica is no longer considered leader and is
> removed from any replica fetcher thread. (The clients will receive a
> UnknownTopicOrPartitionException at this point)
> - The broker notifies the controller by creating a sequential znode under
> path /log_dir_event_notification with data of the format {"version" : 1,
> "broker" : brokerId, "event" : LogDirFailure}.
> - The controller reads the znode to get the brokerId and finds that the event
> type is LogDirFailure.
> - The controller deletes the notification znode
> - The controller sends LeaderAndIsrRequest to that broker to query the state
> of all topic partitions on the broker. The LeaderAndIsrResponse from this
> broker will specify KafkaStorageException for those partitions that are on
> the bad log directories.
> - The controller updates the information of offline replicas in memory and
> trigger leader election as appropriate.
> - The controller removes offline replicas from ISR in the ZK and sends
> LeaderAndIsrRequest with updated ISR to be used by partition leaders.
> - The controller propagates the information of offline replicas to brokers by
> sending UpdateMetadataRequest.
> {noformat}
> Instead of the notification ZNode we should use a Kafka protocol that sends a
> notification message to the controller with the offline partitions. The
> controller then updates the information of offline replicas in memory and
> trigger leader election, then removes the replicas from ISR in ZK and sends a
> LAIR and an UpdateMetadataRequest.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)