[
https://issues.apache.org/jira/browse/HBASE-27053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539859#comment-17539859
]
Sergey Soldatov commented on HBASE-27053:
-----------------------------------------
So, why and when did it happen. The usual scenario is when a region split has
just happened and RS is trying to open both daughters and load those to the
bucket cache. There is a single store that has the split point, so two threads
are reading the same block and trying to store it in the cache. When we
decompress the block, the first thing we are doing is the space allocation:
[https://github.com/apache/hbase/blob/c7eb30d91015de67fb8207ac1818ce2a29dd60a4/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L649]
Here we allocate the space for uncompressed data *AND* checksum. But we fill
the data only, without filling the checksum space which actually might be
filled by garbage from the previous usage. So the block is cached with this
garbage and obviously, that part might and would not match when we try to store
the same block from another thread.
Unfortunately, I was unable to create a reasonable unit test for this scenario,
but there is the manual steps to reproduce:
a single node cluster with following configuration tweaks:
hbase.hregion.memstore.flush.size=1000000
hbase.hregion.max.filesize=10000000
hbase.bucketcache.ioengine=file:/tmp/hbase_cache
hbase.bucketcache.size=200000
hbase.hfile.thread.prefetch=4
And the load is generated by:
{noformat}
hbase org.apache.hadoop.hbase.util.LoadTestTool -compression SNAPPY -write
1:10:100 -num_keys 100000000 {noformat}
Usually, after 700k-1.2m records, the earlier mentioned exception appears in
the RS log.
So, to solve the problem I would suggest adding a code that cleans up the
checksum space when the decompression is completed. It doesn't look like an
optimal solution, but right after decompression, we don't know whether the
checksum space will be used or not, so we could not just trim the bytebuff.
> IOException during caching of uncompressed block to the block cache.
> --------------------------------------------------------------------
>
> Key: HBASE-27053
> URL: https://issues.apache.org/jira/browse/HBASE-27053
> Project: HBase
> Issue Type: Bug
> Components: BlockCache
> Affects Versions: 2.4.12
> Reporter: Sergey Soldatov
> Assignee: Sergey Soldatov
> Priority: Major
>
> When prefetch to block cache is enabled and blocks are compressed sometimes
> caching fails with the exception:
> {noformat}
> 2022-05-18 21:37:29,597 ERROR [RS_OPEN_REGION-regionserver/x1:16020-2]
> regionserver.HRegion: Could not initialize all stores for the
> region=cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7.
> 2022-05-18 21:37:29,598 WARN [RS_OPEN_REGION-regionserver/x1:16020-2]
> regionserver.HRegion: Failed initialize of region=
> cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7.,
> starting to roll back memstore
> java.io.IOException: java.io.IOException: java.lang.RuntimeException: Cached
> block contents differ, which should not have
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1149)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1092)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:996)
> at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:946)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7240)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegionFromTableDir(HRegion.java:7199)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7175)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7134)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7090)
> at
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:147)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:100)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.RuntimeException: Cached block
> contents differ, which should not have
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.openStoreFiles(StoreEngine.java:294)
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.initialize(StoreEngine.java:344)
> at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:294)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6375)
> at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1115)
> at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1112)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ... 3 more
> Caused by: java.lang.RuntimeException: Cached block contents differ, which
> should not have
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
> at
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.validateBlockAddition(BlockCacheUtil.java:199)
> at
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.shouldReplaceExistingCacheBlock(BlockCacheUtil.java:231)
> at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.shouldReplaceExistingCacheBlock(BucketCache.java:447)
> at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlockWithWait(BucketCache.java:432)
> at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlock(BucketCache.java:418)
> at
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:60)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.lambda$readBlock$2(HFileReaderImpl.java:1319)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1317)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readAndUpdateNewBlock(HFileReaderImpl.java:942)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:931)
> at
> org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:171)
> at
> org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:321)
> at org.apache.hadoop.hbase.regionserver.HStoreFile.open(HStoreFile.java:477)
> at
> org.apache.hadoop.hbase.regionserver.HStoreFile.initReader(HStoreFile.java:490)
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.createStoreFileAndReader(StoreEngine.java:231)
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.lambda$openStoreFiles$0(StoreEngine.java:272)
> ... 6 more
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)