[
https://issues.apache.org/jira/browse/HBASE-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158378#comment-16158378
]
Wang, Xinglong commented on HBASE-18764:
----------------------------------------
[~elserj]
Thanks for the response.
IMO, for a system with SLA requirement, like such a call path: User request -->
Service A ---> Service B. if request to Service A out of SLA, then Service A
have to determine whether the out of SLA is caused by underlaying Service B or
not, in this case service A has to get the knowledge about responsiveness of
Service B.
When hbase is slow, our first instinct is that hbase is not stable. And then
our first action will be doing check on hbase regionserver. And then hdfs
metrics. It's just hard to correlate these 2 parts.
It’s a headache for us to determine what slow down hbase from responding to
users’ request. Our customers complained a lot about this.
For example, in one of our cluster we have ambari grafana with a lot predefined
metics monitor both for hbase and hdfs.
We can tell that some region servers are with high get latency sometime.
However we can’t tell which underlying datanode is performing bad. Even we find
a datanode with an abnormal metrics like high fsyncNanoTime. We are still
uncertain that this node is the root cause. All we can do is to guess, that
this might be a naughty node that it impacted the whole cluster. If we have
some specific log entry that tells us which datanode is slow while hbase
accessing datanode, then it will be very straight forward.
And most of time, we find some region servers with high get latency or put
latency, however we could not find anything abnormal from datanode metrics. In
this case, either the issue is with that region server, or the issue is with a
remote datanode, and the remote datanode is not always slow(So we can’t tell
from datanode metrics)
In another shared hdfs case, due to the underlying hdfs is shared hdfs. It is
even harder to find out the slow datanode. Even we find out a bad datanode then
how to correlate that this datanode is the one datanode that impacted hbase
request and then impacted user requests to hbase. It is because a datanode is
serving a lot of traffic in same time, we can’t distinguish traffic from hbase
and other user. And if only hbase user is impacted due to like a sector on disk
is bad, while other users are fine because they don’t use this sector. This
case will be difficult to triage the issue, because we can’t find the path from
a specific hbase request —> datanode —> disk. In a slow read case, the path is
unvisiable to users and even to hbase admin. And if we have no clue which is
the slow datanode, then we can’t proceed.
IMO, It will make life easier if we have some kind of waring for us to alert
about possible slow datanode other than for us to guess which one is slow and
might contribute to the slow hbase performance.
When we talk about hdfs metrics, we will see 95th, 90th, mean, median etc.
Usually in case a datanode is only slow to access a small set of data(a small
number of bad sector on local disk), then this kind of issue will be ommited
and we can't see it from aggeragated metrics.
> add slow read block log entry to alert slow datanodeinfo when reading a block
> is slow
> -------------------------------------------------------------------------------------
>
> Key: HBASE-18764
> URL: https://issues.apache.org/jira/browse/HBASE-18764
> Project: HBase
> Issue Type: Improvement
> Components: HFile
> Affects Versions: 1.1.2
> Reporter: Wang, Xinglong
> Priority: Minor
> Attachments: HBASE-18764.rev1.1.2.patch
>
>
> HBASE is on top of HDFS and both are distributed system. HBASE will also get
> impacted when there is struggler datanode due to network/disk/cpu issue. All
> HBASE read/scan towards that datanode will be slowdown. It's not easy for
> hbase admin to find out the struggler datanode in such case.
> While we have a log entry known as slow sync. One such entry is like the
> following. It will help hbase admin to quickly identify the slow datanode in
> the pipline in case of network/disk/cup issue with one of the 3 datanods in
> pipeline.
> {noformat}
> 2017-07-08 19:11:30,538 INFO [sync.3] wal.FSHLog: Slow sync cost: 490189 ms,
> current pipeline:
> [DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-c391299a-aa9f-4146-ac7e-a493ae536bff,DISK],
> DatanodeInfoWithSt
> orage[xx.xx.xx.xx:50010,DS-21a85f8b-f389-4f9e-95a8-b711945fd210,DISK],
> DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-aa48cef2-3554-482f-b49d-be4763f4d8b8,DISK]]
> {noformat}
> Inspired by slow sync log entry, I think it will also be beneficial for us to
> print out such kind of entry when we encounter slow read case. So that it
> will be easy to identify the slow datanode.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)