[
https://issues.apache.org/jira/browse/KAFKA-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745672#comment-16745672
]
Jun Rao commented on KAFKA-7836:
--------------------------------
[~lindong], it seems that we could call zkClient.propagateLogDirEvent after the
relevant partitions are marked offline, but before
logManager.handleLogDirFailure, to speed up the propagation of log dir failure
to the controller. Do you see any issue with that? Thanks.
> The propagation of log dir failure can be delayed due to slowness in closing
> the file handles
> ---------------------------------------------------------------------------------------------
>
> Key: KAFKA-7836
> URL: https://issues.apache.org/jira/browse/KAFKA-7836
> Project: Kafka
> Issue Type: Improvement
> Reporter: Jun Rao
> Priority: Major
>
> In ReplicaManager.handleLogDirFailure(), we call
> zkClient.propagateLogDirEvent after logManager.handleLogDirFailure. The
> latter closes the file handles of the offline replicas, which could take time
> when the disk is bad. This will delay the new leader election by the
> controller. In one incident, we have seen the closing of file handles of
> multiple replicas taking more than 20 seconds.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)