[
https://issues.apache.org/jira/browse/KAFKA-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045734#comment-17045734
]
Viktor Somogyi-Vass commented on KAFKA-9118:
--------------------------------------------
Hey Boyang I made a change in https://github.com/apache/kafka/pull/7716/files
to add a BrokerToController request thread that currently blocks this. If you
could help reviewing that then we could unblock this and you could work on
this. I'll update 7716 tomorrow so it won't be a draft anymore.
> LogDirFailureHandler shouldn't use Zookeeper
> --------------------------------------------
>
> Key: KAFKA-9118
> URL: https://issues.apache.org/jira/browse/KAFKA-9118
> Project: Kafka
> Issue Type: Sub-task
> Reporter: Viktor Somogyi-Vass
> Assignee: Viktor Somogyi-Vass
> Priority: Major
>
> As described in
> [KIP-112|https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD#KIP-112:HandlediskfailureforJBOD-Zookeeper]:
> {noformat}
> 2. A log directory stops working on a broker during runtime
> - The controller watches the path /log_dir_event_notification for new znode.
> - The broker detects offline log directories during runtime.
> - The broker takes actions as if it has received StopReplicaRequest for this
> replica. More specifically, the replica is no longer considered leader and is
> removed from any replica fetcher thread. (The clients will receive a
> UnknownTopicOrPartitionException at this point)
> - The broker notifies the controller by creating a sequential znode under
> path /log_dir_event_notification with data of the format {"version" : 1,
> "broker" : brokerId, "event" : LogDirFailure}.
> - The controller reads the znode to get the brokerId and finds that the event
> type is LogDirFailure.
> - The controller deletes the notification znode
> - The controller sends LeaderAndIsrRequest to that broker to query the state
> of all topic partitions on the broker. The LeaderAndIsrResponse from this
> broker will specify KafkaStorageException for those partitions that are on
> the bad log directories.
> - The controller updates the information of offline replicas in memory and
> trigger leader election as appropriate.
> - The controller removes offline replicas from ISR in the ZK and sends
> LeaderAndIsrRequest with updated ISR to be used by partition leaders.
> - The controller propagates the information of offline replicas to brokers by
> sending UpdateMetadataRequest.
> {noformat}
> Instead of the notification ZNode we should use a Kafka protocol that sends a
> notification message to the controller with the offline partitions. The
> controller then updates the information of offline replicas in memory and
> trigger leader election, then removes the replicas from ISR in ZK and sends a
> LAIR and an UpdateMetadataRequest.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)