[jira] [Commented] (HBASE-27053) IOException during caching of uncompressed block to the block cache.

Sergey Soldatov (Jira) Thu, 19 May 2022 17:11:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539859#comment-17539859
 ]


Sergey Soldatov commented on HBASE-27053:
-----------------------------------------

So, why and when did it happen. The usual scenario is when a region split has 
just happened and RS is trying to open both daughters and load those to the 
bucket cache. There is a single store that has the split point, so two threads 
are reading the same block and trying to store it in the cache. When we 
decompress the block, the first thing we are doing is the space allocation:

[https://github.com/apache/hbase/blob/c7eb30d91015de67fb8207ac1818ce2a29dd60a4/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L649]

Here we allocate the space for uncompressed data *AND* checksum. But we fill 
the data only, without filling the checksum space which actually might be 
filled by garbage from the previous usage. So the block is cached with this 
garbage and obviously, that part might and would not match when we try to store 
the same block from another thread. 

Unfortunately, I was unable to create a reasonable unit test for this scenario, 
but there is the manual steps to reproduce:

a single node cluster with following configuration tweaks:

hbase.hregion.memstore.flush.size=1000000 

hbase.hregion.max.filesize=10000000 

hbase.bucketcache.ioengine=file:/tmp/hbase_cache

hbase.bucketcache.size=200000

hbase.hfile.thread.prefetch=4

 

And the load is generated by:
{noformat}

hbase org.apache.hadoop.hbase.util.LoadTestTool -compression SNAPPY -write 
1:10:100 -num_keys 100000000 {noformat}
Usually, after 700k-1.2m records, the earlier mentioned exception appears in 
the RS log.

So, to solve the problem I would suggest adding a code that cleans up the 
checksum space when the decompression is completed. It doesn't look like an 
optimal solution, but right after decompression, we don't know whether the 
checksum space will be used or not, so we could not just trim the bytebuff. 

> IOException during caching of uncompressed block to the block cache.
> --------------------------------------------------------------------
>
>                 Key: HBASE-27053
>                 URL: https://issues.apache.org/jira/browse/HBASE-27053
>             Project: HBase
>          Issue Type: Bug
>          Components: BlockCache
>    Affects Versions: 2.4.12
>            Reporter: Sergey Soldatov
>            Assignee: Sergey Soldatov
>            Priority: Major
>
> When prefetch to block cache is enabled and blocks are compressed sometimes 
> caching fails with the exception:
> {noformat}
> 2022-05-18 21:37:29,597 ERROR [RS_OPEN_REGION-regionserver/x1:16020-2] 
> regionserver.HRegion: Could not initialize all stores for the 
> region=cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7.
> 2022-05-18 21:37:29,598 WARN  [RS_OPEN_REGION-regionserver/x1:16020-2] 
> regionserver.HRegion: Failed initialize of region= 
> cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7., 
> starting to roll back memstore
> java.io.IOException: java.io.IOException: java.lang.RuntimeException: Cached 
> block contents differ, which should not have 
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1149)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1092)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:996)
>   at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:946)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7240)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegionFromTableDir(HRegion.java:7199)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7175)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7134)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7090)
>   at 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:147)
>   at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:100)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.RuntimeException: Cached block 
> contents differ, which should not have 
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.openStoreFiles(StoreEngine.java:294)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.initialize(StoreEngine.java:344)
>   at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:294)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6375)
>   at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1115)
>   at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1112)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.lang.RuntimeException: Cached block contents differ, which 
> should not have 
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at 
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.validateBlockAddition(BlockCacheUtil.java:199)
>   at 
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.shouldReplaceExistingCacheBlock(BlockCacheUtil.java:231)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.shouldReplaceExistingCacheBlock(BucketCache.java:447)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlockWithWait(BucketCache.java:432)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlock(BucketCache.java:418)
>   at 
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:60)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.lambda$readBlock$2(HFileReaderImpl.java:1319)
>   at java.util.Optional.ifPresent(Optional.java:159)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1317)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readAndUpdateNewBlock(HFileReaderImpl.java:942)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:931)
>   at 
> org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:171)
>   at 
> org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:321)
>   at org.apache.hadoop.hbase.regionserver.HStoreFile.open(HStoreFile.java:477)
>   at 
> org.apache.hadoop.hbase.regionserver.HStoreFile.initReader(HStoreFile.java:490)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.createStoreFileAndReader(StoreEngine.java:231)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.lambda$openStoreFiles$0(StoreEngine.java:272)
>   ... 6 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HBASE-27053) IOException during caching of uncompressed block to the block cache.

Reply via email to