[
https://issues.apache.org/jira/browse/HBASE-23887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127416#comment-17127416
]
Danil Lipovoy commented on HBASE-23887:
---------------------------------------
[~bharathv]
I have found there is no difference in performance while we are scanning:
!scan.png!
The cause looks like low GC during the scan. So it doesn't matter what kind of
BC to check. I think when we scan we got the bottle neck in another place (it
is not obvious where) and that's why results the same.
Another thing, I a little bit changed logic of calculation which of percent we
have to skip (cache.cacheDataBlockPercent).
I added new param that help to control it
{color:#067d17}hbase.lru.cache.heavy.eviction.overhead.coefficient = 0.01
(heavyEvictionOverheadCoefficient)
{color}
{code:java}
freedDataOverheadPercent = (int) (mbFreedSum * 100 /
cache.heavyEvictionMbSizeLimit) - 100;
...
if (heavyEvictionCount > cache.heavyEvictionCountLimit) {
int ch = (int) (freedDataOverheadPercent *
cache.heavyEvictionOverheadCoefficient);
ch = ch > 15 ? 15 : ch;
ch = ch < 0 ? 0 : ch;
cache.cacheDataBlockPercent -= ch;
cache.cacheDataBlockPercent = cache.cacheDataBlockPercent < 1 ? 1 :
cache.cacheDataBlockPercent;
}
{code}
And when we go below
{color:#067d17}*hbase.lru.cache.heavy.eviction.mb.size.limit*{color}
{color:#172b4d}We use backward pressure:
{color}
{code:java}
if (mbFreedSum >= cache.heavyEvictionMbSizeLimit * 0.1) {
// It help avoid exit during short-term fluctuation
int ch = (int) (-freedDataOverheadPercent * 0.1 + 1);
cache.cacheDataBlockPercent += ch;
cache.cacheDataBlockPercent = cache.cacheDataBlockPercent > 100 ? 100 :
cache.cacheDataBlockPercent;
} else {
heavyEvictionCount = 0;
cache.cacheDataBlockPercent = 100;
}
{code}
{color:#172b4d}How it looks:{color}
{color:#172b4d}!graph.png!{color}
So, when GC works hard is reduce percent of cached blocks. When we jump below
the level, it help come back:
BlockCache evicted (MB): 4902, overhead (%): 2351, heavy eviction counter: 1,
current caching DataBlock (%): 85 < too much, fast slow down
BlockCache evicted (MB): 5700, overhead (%): 2750, heavy eviction counter: 2,
current caching DataBlock (%): 70
BlockCache evicted (MB): 5930, overhead (%): 2865, heavy eviction counter: 3,
current caching DataBlock (%): 55
BlockCache evicted (MB): 4446, overhead (%): 2123, heavy eviction counter: 4,
current caching DataBlock (%): 40
BlockCache evicted (MB): 3078, overhead (%): 1439, heavy eviction counter: 5,
current caching DataBlock (%): 26
BlockCache evicted (MB): 1710, overhead (%): 755, heavy eviction counter: 6,
current caching DataBlock (%): 19 < easy
BlockCache evicted (MB): 1026, overhead (%): 413, heavy eviction counter: 7,
current caching DataBlock (%): 15
BlockCache evicted (MB): 570, overhead (%): 185, heavy eviction counter: 8,
current caching DataBlock (%): 14 < easy
BlockCache evicted (MB): 342, overhead (%): 71, heavy eviction counter: 9,
current caching DataBlock (%): 14
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 10,
current caching DataBlock (%): 14
BlockCache evicted (MB): 114, overhead (%): -43, heavy eviction counter: 10,
current caching DataBlock (%): 19 < back pressure
BlockCache evicted (MB): 570, overhead (%): 185, heavy eviction counter: 11,
current caching DataBlock (%): 18
BlockCache evicted (MB): 570, overhead (%): 185, heavy eviction counter: 12,
current caching DataBlock (%): 17
BlockCache evicted (MB): 456, overhead (%): 128, heavy eviction counter: 13,
current caching DataBlock (%): 16
BlockCache evicted (MB): 342, overhead (%): 71, heavy eviction counter: 14,
current caching DataBlock (%): 16
BlockCache evicted (MB): 342, overhead (%): 71, heavy eviction counter: 15,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 16,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 17,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 18,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 19,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 20,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 21,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 22,
current caching DataBlock (%): 16
BlockCache evicted (MB): 114, overhead (%): -43, heavy eviction counter: 22,
current caching DataBlock (%): 21 < back pressure
BlockCache evicted (MB): 798, overhead (%): 299, heavy eviction counter: 23,
current caching DataBlock (%): 19
BlockCache evicted (MB): 684, overhead (%): 242, heavy eviction counter: 24,
current caching DataBlock (%): 17
BlockCache evicted (MB): 570, overhead (%): 185, heavy eviction counter: 25,
current caching DataBlock (%): 16
BlockCache evicted (MB): 456, overhead (%): 128, heavy eviction counter: 26,
current caching DataBlock (%): 15
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 27,
current caching DataBlock (%): 15
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 28,
current caching DataBlock (%): 15
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 29,
current caching DataBlock (%): 15
BlockCache evicted (MB): 114, overhead (%): -43, heavy eviction counter: 29,
current caching DataBlock (%): 20 < back pressure
BlockCache evicted (MB): 684, overhead (%): 242, heavy eviction counter: 30,
current caching DataBlock (%): 18
BlockCache evicted (MB): 570, overhead (%): 185, heavy eviction counter: 31,
current caching DataBlock (%): 17
BlockCache evicted (MB): 456, overhead (%): 128, heavy eviction counter: 32,
current caching DataBlock (%): 16
BlockCache evicted (MB): 342, overhead (%): 71, heavy eviction counter: 33,
current caching DataBlock (%): 16
BlockCache evicted (MB): 342, overhead (%): 71, heavy eviction counter: 34,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 35,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 36,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 37,
current caching DataBlock (%): 16
BlockCache evicted (MB): 228, overhead (%): 14, heavy eviction counter: 38,
current caching DataBlock (%): 16
BlockCache evicted (MB): 114, overhead (%): -43, heavy eviction counter: 38,
current caching DataBlock (%): 21
BlockCache evicted (MB): 0, overhead (%): -100, heavy eviction counter: 0,
current caching DataBlock (%): 100 < the test have done
BlockCache evicted (MB): 0, overhead (%): -100, heavy eviction counter: 0,
current caching DataBlock (%): 100
Compare:
!requests_new2_100p.png!
| |*original*|*feature*|*%*|
|tbl1-u (ops/sec)|29,601|41,623|141|
|tbl2-u (ops/sec)|38,793|64,630|167|
|tbl3-u (ops/sec)|38,216|63,120|165|
|tbl4-u (ops/sec)|325|736|226|
|tbl1-z (ops/sec)|46,990|60,651|129|
|tbl2-z (ops/sec)|54,401|72,688|134|
|tbl3-z (ops/sec)|57,100|72,035|126|
|tbl4-z (ops/sec)|452|865|191|
|tbl1-l (ops/sec)|56,001|65,935|118|
|tbl2-l (ops/sec)|68,700|77,704|113|
|tbl3-l (ops/sec)|64,189|74,300|116|
|tbl4-l (ops/sec)|619|1,050|170|
| | | | |
| | | | |
| |*original*|*feature*|*%*|
|tbl1-u AverageLatency(us)|1,686|1,199|71|
|tbl2-u AverageLatency(us)|1,287|772|60|
|tbl3-u AverageLatency(us)|1,306|790|60|
|tbl4-u AverageLatency(us)|76,810|33,923|44|
|tbl1-z AverageLatency(us)|1,061|822|77|
|tbl2-z AverageLatency(us)|917|686|75|
|tbl3-z AverageLatency(us)|873|692|79|
|tbl4-z AverageLatency(us)|55,114|28,807|52|
|tbl1-l AverageLatency(us)|890|756|85|
|tbl2-l AverageLatency(us)|726|641|88|
|tbl3-l AverageLatency(us)|777|671|86|
|tbl4-l AverageLatency(us)|40,235|23,712|59|
| | | | |
| | | | |
| |*original*|*feature*|*%*|
|tbl1-u 95thPercentileLatency(us)|2,831|2,417|85|
|tbl2-u 95thPercentileLatency(us)|1,266|1,046|83|
|tbl3-u 95thPercentileLatency(us)|1,497|1,140|76|
|tbl4-u 95thPercentileLatency(us)|370,943|45,119|12|
|tbl1-z 95thPercentileLatency(us)|1,784|1,590|89|
|tbl2-z 95thPercentileLatency(us)|918|887|97|
|tbl3-z 95thPercentileLatency(us)|978|939|96|
|tbl4-z 95thPercentileLatency(us)|336,639|44,447|13|
|tbl1-l 95thPercentileLatency(us)|1,523|1,388|91|
|tbl2-l 95thPercentileLatency(us)|820|819|100|
|tbl3-l 95thPercentileLatency(us)|918|890|97|
|tbl4-l 95thPercentileLatency(us)|77,951|43,071|55|
> BlockCache performance improve by reduce eviction rate
> ------------------------------------------------------
>
> Key: HBASE-23887
> URL: https://issues.apache.org/jira/browse/HBASE-23887
> Project: HBase
> Issue Type: Improvement
> Components: BlockCache, Performance
> Reporter: Danil Lipovoy
> Priority: Minor
> Attachments: 1582787018434_rs_metrics.jpg,
> 1582801838065_rs_metrics_new.png, BC_LongRun.png,
> BlockCacheEvictionProcess.gif, cmp.png, evict_BC100_vs_BC23.png,
> eviction_100p.png, eviction_100p.png, eviction_100p.png, gc_100p.png,
> graph.png, read_requests_100pBC_vs_23pBC.png, requests_100p.png,
> requests_100p.png, requests_new2_100p.png, requests_new_100p.png, scan.png
>
>
> Hi!
> I first time here, correct me please if something wrong.
> I want propose how to improve performance when data in HFiles much more than
> BlockChache (usual story in BigData). The idea - caching only part of DATA
> blocks. It is good becouse LruBlockCache starts to work and save huge amount
> of GC.
> Sometimes we have more data than can fit into BlockCache and it is cause a
> high rate of evictions. In this case we can skip cache a block N and insted
> cache the N+1th block. Anyway we would evict N block quite soon and that why
> that skipping good for performance.
> Example:
> Imagine we have little cache, just can fit only 1 block and we are trying to
> read 3 blocks with offsets:
> 124
> 198
> 223
> Current way - we put the block 124, then put 198, evict 124, put 223, evict
> 198. A lot of work (5 actions).
> With the feature - last few digits evenly distributed from 0 to 99. When we
> divide by modulus we got:
> 124 -> 24
> 198 -> 98
> 223 -> 23
> It helps to sort them. Some part, for example below 50 (if we set
> *hbase.lru.cache.data.block.percent* = 50) go into the cache. And skip
> others. It means we will not try to handle the block 198 and save CPU for
> other job. In the result - we put block 124, then put 223, evict 124 (3
> actions).
> See the picture in attachment with test below. Requests per second is higher,
> GC is lower.
>
> The key point of the code:
> Added the parameter: *hbase.lru.cache.data.block.percent* which by default =
> 100
>
> But if we set it 1-99, then will work the next logic:
>
>
> {code:java}
> public void cacheBlock(BlockCacheKey cacheKey, Cacheable buf, boolean
> inMemory) {
> if (cacheDataBlockPercent != 100 && buf.getBlockType().isData())
> if (cacheKey.getOffset() % 100 >= cacheDataBlockPercent)
> return;
> ...
> // the same code as usual
> }
> {code}
>
> Other parameters help to control when this logic will be enabled. It means it
> will work only while heavy reading going on.
> hbase.lru.cache.heavy.eviction.count.limit - set how many times have to run
> eviction process that start to avoid of putting data to BlockCache
> hbase.lru.cache.heavy.eviction.bytes.size.limit - set how many bytes have to
> evicted each time that start to avoid of putting data to BlockCache
> By default: if 10 times (100 secunds) evicted more than 10 MB (each time)
> then we start to skip 50% of data blocks.
> When heavy evitions process end then new logic off and will put into
> BlockCache all blocks again.
>
> Descriptions of the test:
> 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 4 RegionServers
> 4 tables by 64 regions by 1.88 Gb data in each = 600 Gb total (only FAST_DIFF)
> Total BlockCache Size = 48 Gb (8 % of data in HFiles)
> Random read in 20 threads
>
> I am going to make Pull Request, hope it is right way to make some
> contribution in this cool product.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)