[ 
https://issues.apache.org/jira/browse/HBASE-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158378#comment-16158378
 ] 

Wang, Xinglong commented on HBASE-18764:
----------------------------------------

[~elserj]
Thanks for the response.
IMO, for a system with SLA requirement, like such a call path: User request --> 
Service A  ---> Service B. if request to Service A out of SLA, then Service A 
have to determine whether the out of SLA is caused by underlaying Service B or 
not, in this case service A has to get the knowledge about responsiveness of 
Service B.

When hbase is slow, our first instinct is that hbase is not stable. And then 
our first action will be doing check on hbase regionserver. And then hdfs 
metrics. It's just hard to correlate these 2 parts.

It’s a headache for us to determine what slow down hbase from responding to 
users’ request. Our customers complained a lot about this.

For example, in one of our cluster we have ambari grafana with a lot predefined 
metics monitor both for hbase and hdfs.
We can tell that some region servers are with high get latency sometime. 
However we can’t tell which underlying datanode is performing bad. Even we find 
a datanode with an abnormal metrics like high fsyncNanoTime. We are still 
uncertain that this node is the root cause. All we can do is to guess, that 
this might be a naughty node that it impacted the whole cluster. If we have 
some specific log entry that tells us which datanode is slow while hbase 
accessing datanode, then it will be very straight forward. 

And most of time, we find some region servers with high get latency or put 
latency, however we could not find anything abnormal from datanode metrics. In 
this case, either the issue is with that region server, or the issue is with a 
remote datanode, and the remote datanode is not always slow(So we can’t tell 
from datanode metrics)

In another shared hdfs case, due to the underlying hdfs is shared hdfs. It is 
even harder to find out the slow datanode. Even we find out a bad datanode then 
how to correlate that this datanode is the one datanode that impacted hbase 
request and then impacted user requests to hbase. It is because a datanode is 
serving a lot of traffic in same time, we can’t distinguish traffic from hbase 
and other user. And if only hbase user is impacted due to like a sector on disk 
is bad, while other users are fine because they don’t use this sector. This 
case will be difficult to triage the issue, because we can’t find the path from 
a specific hbase request —> datanode —> disk. In a slow read case, the path is 
unvisiable to users and even to hbase admin. And if we have no clue which is 
the slow datanode, then we can’t proceed.

IMO, It will make life easier if we have some kind of waring for us to alert 
about possible slow datanode other than for us to guess which one is slow and 
might contribute to the slow hbase performance.

When we talk about hdfs metrics, we will see 95th, 90th, mean, median etc. 
Usually in case a datanode is only slow to access a small set of data(a small 
number of bad sector on local disk), then this kind of issue will be ommited 
and we can't see it from aggeragated metrics.

> add slow read block log entry to alert slow datanodeinfo when reading a block 
> is slow
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-18764
>                 URL: https://issues.apache.org/jira/browse/HBASE-18764
>             Project: HBase
>          Issue Type: Improvement
>          Components: HFile
>    Affects Versions: 1.1.2
>            Reporter: Wang, Xinglong
>            Priority: Minor
>         Attachments: HBASE-18764.rev1.1.2.patch
>
>
> HBASE is on top of HDFS and both are distributed system. HBASE will also get 
> impacted when there is struggler datanode due to network/disk/cpu issue. All 
> HBASE read/scan towards that datanode will be slowdown. It's not easy for 
> hbase admin to find out the struggler datanode in such case.
> While we have a log entry known as slow sync. One such entry is like the 
> following. It will help hbase admin to quickly identify the slow datanode in 
> the pipline in case of network/disk/cup issue with one of the 3 datanods in 
> pipeline.
> {noformat}
> 2017-07-08 19:11:30,538 INFO  [sync.3] wal.FSHLog: Slow sync cost: 490189 ms, 
> current pipeline: 
> [DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-c391299a-aa9f-4146-ac7e-a493ae536bff,DISK],
>  DatanodeInfoWithSt
> orage[xx.xx.xx.xx:50010,DS-21a85f8b-f389-4f9e-95a8-b711945fd210,DISK], 
> DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-aa48cef2-3554-482f-b49d-be4763f4d8b8,DISK]]
> {noformat}
> Inspired by slow sync log entry, I think it will also be beneficial for us to 
> print out such kind of entry when we encounter slow read case. So that it 
> will be easy to identify the slow datanode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to