[jira] [Updated] (HBASE-21436) Getting OOM frequently if hold many regions

Zephyr Guo (JIRA) Mon, 05 Nov 2018 03:26:21 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zephyr Guo updated HBASE-21436:
-------------------------------
    Description: 
Recently, some feedback reached me from a customer which complains about 
NotServingRegionException thrown out at intevals. I examined his cluster and 
found there were quite a lot of OOM logs there but metric 
"readDataPerSecondKB/writeDataPerSecondKB" is in quite low level. In this 
customer's case, each RS has 3k regions and heap size of 4G. I dumped heap when 
OOM took place, and found that a lot of Chunk objects (counts as much as 1700) 
was there.
 Eventually, piecing all these evidences together, I came to the conclusion 
that:
 * The root cause is that global flush is triggered by size of all memstores, 
rather than size of all chunks.
 * A chunk is always allocated for each region, even we only write a few data 
to the region.



And in this case, a total of 3.4G memory was consumed by 1700 chunks, although 
throughput is very low.
 Although 3K regions is too much for RS with 4G memory, it is still wise to 
improve RS stability in such scenario (In fact, most customers buy a small size 
HBase on cloud side).
  
 I provide a patch (only contain UT) to reproduce this case (just send a batch).

  was:
Recently, some feedback reached me from a customer which complains about 
NotServingRegionException thrown out at intevals. I examined his cluster and 
found there were quite a lot of OOM logs there but metric 
"readDataPerSecondKB/writeDataPerSecondKB" is in quite low level. In this 
customer's case, each RS has 3k regions and heap size of 4G. I dumped heap when 
OOM took place, and found that a lot of Chunk objects (counts as much as 1700) 
was there.
 Eventually, piecing all these evidences together, I came to the conclusion 
that: 1. The root cause is that global flush is triggered by size of all 
memstores, rather than size of all chunks. 2. A chunk is always allocated for 
each region, even we only write a few data to the region.
 And in this case, a total of 3.4G memory was consumed by 1700 chunks, although 
throughput is very low.
Although 3K regions is too much for RS with 4G memory, it is still wise to 
improve RS stability in such scenario (In fact, most customers buy a small size 
HBase on cloud side).
 
I provide a patch (only contain UT) to reproduce this case (just send a batch).


>  Getting OOM frequently if hold many regions
> --------------------------------------------
>
>                 Key: HBASE-21436
>                 URL: https://issues.apache.org/jira/browse/HBASE-21436
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 3.0.0, 1.4.8, 2.0.2
>            Reporter: Zephyr Guo
>            Priority: Major
>         Attachments: HBASE-21436-UT.patch
>
>
> Recently, some feedback reached me from a customer which complains about 
> NotServingRegionException thrown out at intevals. I examined his cluster and 
> found there were quite a lot of OOM logs there but metric 
> "readDataPerSecondKB/writeDataPerSecondKB" is in quite low level. In this 
> customer's case, each RS has 3k regions and heap size of 4G. I dumped heap 
> when OOM took place, and found that a lot of Chunk objects (counts as much as 
> 1700) was there.
>  Eventually, piecing all these evidences together, I came to the conclusion 
> that:
>  * The root cause is that global flush is triggered by size of all memstores, 
> rather than size of all chunks.
>  * A chunk is always allocated for each region, even we only write a few data 
> to the region.
> And in this case, a total of 3.4G memory was consumed by 1700 chunks, 
> although throughput is very low.
>  Although 3K regions is too much for RS with 4G memory, it is still wise to 
> improve RS stability in such scenario (In fact, most customers buy a small 
> size HBase on cloud side).
>   
>  I provide a patch (only contain UT) to reproduce this case (just send a 
> batch).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21436) Getting OOM frequently if hold many regions

Reply via email to