[
https://issues.apache.org/jira/browse/HBASE-26347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427105#comment-17427105
]
Duo Zhang commented on HBASE-26347:
-----------------------------------
IIRC the old design is to set a smaller timeout so for a slow DN we will get
time out and roll the WAL writer. But anyway, it can not be too small so adding
other threshold to check for slow DN is still valueable I guess.
I can not recall whether we have the logic for maintaining exclude DNs, thanks
for bring this up.
> Support detect and exclude slow DNs in fan-out of WAL
> -----------------------------------------------------
>
> Key: HBASE-26347
> URL: https://issues.apache.org/jira/browse/HBASE-26347
> Project: HBase
> Issue Type: New Feature
> Components: wal
> Affects Versions: 2.0.0, 3.0.0-alpha-2
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
>
> We all knows the WAL sync performance directly affects the RPC process time.
> And we use self-designed FanOutOneBlockAsyncDFSOutput to sync WAL entries,
> which connect straightly to all the block located DNs. But when even one DN
> of the locations is slow, e.g. some disk hardware failures, the WAL syncs
> slow. And what's more, the hardware failure detected by the lower layer HDFS
> system is not so sensitive.
> We can detect slow DNs by the ACK time of packets in
> FanOutOneBlockAsyncDFSOutput, and exclude them when add new blocks after log
> rolled(rolling log can also be triggered by slow syncs). And shows this info
> in UI. We can also invalid these excluded DN cache after a duration, to aware
> the recovery of those DNs.
> I think this idea can quickly reduce the influence of slow DNs, and improve
> the service availability.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)