bbeaudreault commented on code in PR #4410:
URL: https://github.com/apache/hbase/pull/4410#discussion_r927747771
##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java:
##########
@@ -113,7 +117,17 @@ public static BlockCache createBlockCache(Configuration
conf) {
LOG.warn(
"From HBase 2.0 onwards only combined mode of LRU cache and bucket
cache is available");
}
- return bucketCache == null ? l1Cache : new CombinedBlockCache(l1Cache,
bucketCache);
+
+ if (bucketCache == null) {
+ return l1Cache;
+ }
+
+ if (conf.getBoolean(BLOCKCACHE_VICTIM_HANDLER_ENABLED_KEY,
+ BLOCKCACHE_VICTIM_HANDLER_ENABLED_DEFAULT)) {
+ return new InclusiveCombinedBlockCache(l1Cache, bucketCache);
Review Comment:
Ok that makes sense. I've been looking at the implementation of the
`victimHandler` code in LruBlockCache. A cache miss in L1 which is found in L2
[is promoted back to
L1](https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java#L512-L517).
It seems like that would cause your change to not really improve GC, because
we'd continually have META blocks going back and forth between L1 and L2.
So your results were confusing to me, and I also tested this and my results
were similar to yours. But then I noticed in branch-2.4 (and 2), we don't call
[`L1.getBlock` unless
`L1.containsBlock`](https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.java#L81).
This circumvents the victimHandler code above, explaining the results.
But here in master branch, an [improvement was
made](https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.java#L85-L96)
to remove the unnecessary containsBlock call when we know the BlockType (most
cases). So I think in master branch (or if that improvement is ever ported to
branch-2) we'd probably stop seeing the GC improvement.
So I was thinking about this and without the victim promotion back to L1 we
basically have a case where certain blocks will get into L2 and never get back
to L1. This artificially benefits META blocks which may be read later in the
lifetime of the regionserver, because there is less competition for L1 space
since more blocks will have flushed to L2. This doesn't quite feel like a real
L1/L2 cache setup, instead it feels like a single cache where some blocks
randomly exist on-heap and some off-heap.
So I do agree with you that we need to have a solution for the case where
META blocks are much larger than the possible L1 size, but the current approach
feels like an accidental solution rather than a complete one. A few thoughts I
had:
1. Just disable LRU, use a single unified BucketCache as the L1. Personally
I don't think the BucketCache performs that much worse than LRU. In my tests
it's always an improvement over LRU, and @anoopsjohn's reports from years ago
Alibaba tests show that BucketCache is 30% faster.
2. Use LruAdaptiveBlockCache as the L1, which would help reduce the amount
of churn between the L1 and L2. This algorithm is cool, but comes with many
complicated configurations.
3. Enhance the victim handler code to reduce the likelihood of promoting a
victim back into L1. A good example here might be to inspect the accessCount on
the victim block and only promote after multiple access.
I'm guessing option 3 might be the most generally acceptable solution, if
it's possible. I think the simplest form would check the BlockPriority of the
L2-cached block and only promote when it's BlockPriority.MULTI.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]