[ 
https://issues.apache.org/jira/browse/HBASE-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157151#comment-16157151
 ] 

Josh Elser commented on HBASE-18764:
------------------------------------

[~suxingfate], thanks for the patch.

It makes me wonder if this is a good long-term strategy for us to do in HBase. 
While, as you outline, such log messages can tell us that some HBase slowness 
isn't actually an HBase problem, I think it would be a better use of time to 
improve the metrics that HDFS is exposing instead. I think it would be a better 
use of our time, long-term, to change our mindset so that we start thinking 
about the metrics for all systems HBase uses, not just the HBase 
metrics/logging.

For example, in the case that you are outlining, an operator should be capable 
of monitoring HDFS and observe that one DataNode is pathologically slower than 
the rest (and it would require human attention). We shouldn't have to rely on 
HBase to tell us that HDFS is having issues. What do you think?

> add slow read block log entry to alert slow datanodeinfo when reading a block 
> is slow
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-18764
>                 URL: https://issues.apache.org/jira/browse/HBASE-18764
>             Project: HBase
>          Issue Type: Improvement
>          Components: HFile
>    Affects Versions: 1.1.2
>            Reporter: Wang, Xinglong
>            Priority: Minor
>         Attachments: HBASE-18764.rev1.1.2.patch
>
>
> HBASE is on top of HDFS and both are distributed system. HBASE will also get 
> impacted when there is struggler datanode due to network/disk/cpu issue. All 
> HBASE read/scan towards that datanode will be slowdown. It's not easy for 
> hbase admin to find out the struggler datanode in such case.
> While we have a log entry known as slow sync. One such entry is like the 
> following. It will help hbase admin to quickly identify the slow datanode in 
> the pipline in case of network/disk/cup issue with one of the 3 datanods in 
> pipeline.
> {noformat}
> 2017-07-08 19:11:30,538 INFO  [sync.3] wal.FSHLog: Slow sync cost: 490189 ms, 
> current pipeline: 
> [DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-c391299a-aa9f-4146-ac7e-a493ae536bff,DISK],
>  DatanodeInfoWithSt
> orage[xx.xx.xx.xx:50010,DS-21a85f8b-f389-4f9e-95a8-b711945fd210,DISK], 
> DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-aa48cef2-3554-482f-b49d-be4763f4d8b8,DISK]]
> {noformat}
> Inspired by slow sync log entry, I think it will also be beneficial for us to 
> print out such kind of entry when we encounter slow read case. So that it 
> will be easy to identify the slow datanode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to