[jira] [Updated] (HBASE-15160) Put back HFile's HDFS op latency sampling code and add metrics for monitoring

Yu Li (JIRA) Fri, 02 Jun 2017 01:22:42 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yu Li updated HBASE-15160:
--------------------------
    Attachment: hbase-15160_v7.patch

Confirmed that with {{System#currentTimeMillis}} the performance regression 
disappeared.
|| Case ||  Throughput (ops/s)|| AverageLatency(us)||
| w/o patch| 122079.26|26019.93|
|w/ patch v7| 121693.28 | 26688.72|

Although this might only happen when using fast disk like PCIe-SSD, I think we 
should still make the change. What's more, milliseconds should be enough to 
monitor spike. Below is the metrics data in the testing with PCIe-SSD:
{noformat}
    "FsPReadTime_num_ops" : 21828053,
    "FsPReadTime_min" : 0,
    "FsPReadTime_max" : 103,
    "FsPReadTime_mean" : 3,
    "FsPReadTime_25th_percentile" : 0,
    "FsPReadTime_median" : 0,
    "FsPReadTime_75th_percentile" : 5,
    "FsPReadTime_90th_percentile" : 7,
    "FsPReadTime_95th_percentile" : 9,
    "FsPReadTime_98th_percentile" : 17,
    "FsPReadTime_99th_percentile" : 91,
    "FsPReadTime_99.9th_percentile" : 98,
    "FsPReadTime_TimeRangeCount_0-1" : 26267,
    "FsPReadTime_TimeRangeCount_1-3" : 455,
    "FsPReadTime_TimeRangeCount_3-10" : 8366,
    "FsPReadTime_TimeRangeCount_10-30" : 661,
    "FsPReadTime_TimeRangeCount_30-100" : 705,
    "FsPReadTime_TimeRangeCount_100-300" : 15,
    "FsPReadTime_TimeRangeCount_600000-inf" : 21791593,
{noformat}

> Put back HFile's HDFS op latency sampling code and add metrics for monitoring
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-15160
>                 URL: https://issues.apache.org/jira/browse/HBASE-15160
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.0.0, 1.1.2
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Critical
>         Attachments: HBASE-15160.patch, HBASE-15160_v2.patch, 
> HBASE-15160_v3.patch, hbase-15160_v4.patch, hbase-15160_v5.patch, 
> hbase-15160_v6.patch, hbase-15160_v7.patch
>
>
> In HBASE-11586 all HDFS op latency sampling code, including fsReadLatency, 
> fsPreadLatency and fsWriteLatency, have been removed. There was some 
> discussion about putting them back in a new JIRA but never happened. 
> According to our experience, these metrics are useful to judge whether issue 
> lies on HDFS when slow request occurs, so we propose to put them back in this 
> JIRA, and add the metrics for monitoring as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HBASE-15160) Put back HFile's HDFS op latency sampling code and add metrics for monitoring

Reply via email to