[ 
https://issues.apache.org/jira/browse/HBASE-27053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564026#comment-17564026
 ] 

Bryan Beaudreault edited comment on HBASE-27053 at 7/8/22 1:49 AM:
-------------------------------------------------------------------

I've been digging into this. Relaying some of my findings from comments on the 
PR:

The problem here is that in HFileBlock#allocateBuffer, we are including space 
for the checksums in the unpacked buffer. But then, when we call 
HFileBlockDecodingContext#prepareEncoding, we pass in 
getUncompressedSizeWithoutHeader() as the uncompressedSize, which does not 
include checksum length. 

When using BucketCache, all HFileBlock buffers are allocated through the 
ByteBuffAllocactor, which pools them for re-use. When a buffer is pulled from 
the pool, clear() is called on the buffer, but that doesn't zero out the bytes. 
It just resets the pointers and limits. The expectation is that if you pull a 
buffer from the pool for a particular capacity, you should write the entire 
capacity of that buffer. Since uncompressedSize is smaller than the buffer 
capacity, we aren't doing that and end up with junk at the end where the 
checksum should be.

This is happening for all HFileBlock buffers, when BucketCache is enabled. But 
this typically doesn't matter, because the checksum is not used after reading 
from disk. In fact, most usages of the HFileBlock buffer get it through 
HFileBlock#getBufferWithoutHeader(), which also excludes the checksum section. 

The reason it matters for this case, is because when splits occur its possible 
for the split key to be in the middle of a block, resulting in that block being 
loaded by both daughter regions on open. This happens in 2 separate threads, so 
a race condition occurs. This race condition was mitigated in HBASE-20447, by 
validating that the to-be-cached block's contents matches the block contents 
that is already in the cache. This is typically the case, but due to the above 
allocation bug, the two blocks each have different junk at the end of the 
buffer, causing the IOException.

I think the best fix here would be to make HFileBlock#allocateBuffer not 
include the checksum bytes in the capacity. The only challenge with that is we 
need to then handle the case in getBufferWithoutHeader() so we properly set the 
limit. Currently getBufferWithoutHeader() always passes {{false}} into 
{{{}getBufferWithoutHeader(boolean withChecksum){}}}. I think this would be a 
matter of changing {{false}} to {{{}!isUnpacked(){}}}. Since packed blocks 
would always have the checksum available, and unpacked would never.

So my suggestion ends up being a two-line change:
 * in HFileBlock#allocateBuffer(), remove {{cksumBytes}} from capacity
 * In HFileBlock#getBufferWithoutHeader(), pass {{!isUnpacked()}} instead of 
{{{}false{}}}.

We can validate this functionality with a new test in TestHFileBlockUnpack:
{code:java}
  @Test
  public void itUnpacksIdenticallyEachTime() throws IOException {
    Path path = new Path(TEST_UTIL.getDataTestDir(), name.getMethodName());
    int totalSize = createTestBlock(path);

    // Allocate a bunch of random buffers, so we can be sure that unpack will 
only have "dirty"
    // buffers to choose from when allocating itself.
    Random random = new Random();
    byte[] temp = new byte[HConstants.DEFAULT_BLOCKSIZE];
    List<ByteBuff> buffs = new ArrayList<>();
    for (int i = 0; i < 10; i++) {
      ByteBuff buff = allocator.allocate(HConstants.DEFAULT_BLOCKSIZE);
      random.nextBytes(temp);
      buff.put(temp);
      buffs.add(buff);
    }

    buffs.forEach(ByteBuff::release);

    // read the same block twice. we should expect the underlying buffer below 
to
    // be identical each time
    HFileBlock blockOne = readBlock(path, totalSize);
    HFileBlock blockTwo = readBlock(path, totalSize);

    ByteBuff bufferOne = blockOne.getBufferWithoutHeader(true);
    ByteBuff bufferTwo = blockTwo.getBufferWithoutHeader(true);

    // This assertion should succeed, but it fails on master. It will succeed 
once
    // cksumBytes is removed from HFileBlock#allocateBuffer.
    assertEquals(0, ByteBuff.compareTo(bufferOne, 0, bufferOne.limit(), 
bufferTwo, 0 ,bufferTwo.limit()));
  } {code}
This test basically ensures that if you read the same block twice, they get the 
exact same underlying buffer. I think this is a good invariant to ensure.


was (Author: bbeaudreault):
I've been digging into this. Relaying some of my findings from comments on the 
PR:

The problem here is that in HFileBlock#allocateBuffer, we are including space 
for the checksums in the unpacked buffer. But then, when we call 
HFileBlockDecodingContext#prepareEncoding, we pass in 
getUncompressedSizeWithoutHeader() as the uncompressedSize, which does not 
include checksum length. This mismatch causes all unpacked HFileBlock buffers 
to have a section of un-set bytes at the end of each buffer.

When using BucketCache, all HFileBlock buffers are allocated through the 
ByteBuffAllocactor, which pools them for re-use. When a buffer is pulled from 
the pool, clear() is called on the buffer, but that doesn't zero out the bytes. 
It just resets the pointers and limits. As a result, you pull a buffer from the 
pool which previously had bytes in the section of the buffer that happens to 
hold the un-set checksum, you end up with junk in the checksum portion.

This is happening for all HFileBlock buffers, when BucketCache is enabled. But 
this typically doesn't matter, because the checksum is not used after reading 
from disk. In fact, most usages of the HFileBlock buffer get it through 
HFileBlock#getBufferWithoutHeader(), which excludes the checksum section. 

The reason it matters for this case, is because when splits occur its possible 
for the split key to be in the middle of a block, resulting in that block being 
loaded by both daughter regions on open. This happens in 2 separate threads, so 
a race condition occurs. This wouldn't be a problem if the 2 blocks were 
identical, but in this case they each have different junk at the end of the 
buffer, in the checksum area. This race condition was detected and mitigated in 
HBASE-20447 by validating the contents of the conflicting block. Since the 
checksum contains junk, we are hitting that case where contents differ often 
enough to cause frequent IOExceptions on split.

I think the best fix here would be to make HFileBlock#allocateBuffer not 
include the checksum bytes in the capacity. The only challenge with that is we 
need to then handle the case in getBufferWithoutHeader() so we properly set the 
limit. Currently getBufferWithoutHeader() always passes {{false}} into 
{{{}getBufferWithoutHeader(boolean withChecksum){}}}. I think this would be a 
matter of changing {{false}} to {{{}!isUnpacked(){}}}. Since packed blocks 
would always have the checksum available, and unpacked would never.

So my suggestion ends up being a two-line change:
 * in HFileBlock#allocateBuffer(), remove {{cksumBytes}} from capacity
 * In HFileBlock#getBufferWithoutHeader(), pass {{!isUnpacked()}} instead of 
{{{}false{}}}.

We can validate this functionality with a new test in TestHFileBlockUnpack:
{code:java}
  @Test
  public void itUnpacksIdenticallyEachTime() throws IOException {
    Path path = new Path(TEST_UTIL.getDataTestDir(), name.getMethodName());
    int totalSize = createTestBlock(path);

    // Allocate a bunch of random buffers, so we can be sure that unpack will 
only have "dirty"
    // buffers to choose from when allocating itself.
    Random random = new Random();
    byte[] temp = new byte[HConstants.DEFAULT_BLOCKSIZE];
    List<ByteBuff> buffs = new ArrayList<>();
    for (int i = 0; i < 10; i++) {
      ByteBuff buff = allocator.allocate(HConstants.DEFAULT_BLOCKSIZE);
      random.nextBytes(temp);
      buff.put(temp);
      buffs.add(buff);
    }

    buffs.forEach(ByteBuff::release);

    // read the same block twice. we should expect the underlying buffer below 
to
    // be identical each time
    HFileBlock blockOne = readBlock(path, totalSize);
    HFileBlock blockTwo = readBlock(path, totalSize);

    ByteBuff bufferOne = blockOne.getBufferWithoutHeader(true);
    ByteBuff bufferTwo = blockTwo.getBufferWithoutHeader(true);

    // This assertion should succeed, but it fails on master. It will succeed 
once
    // cksumBytes is removed from HFileBlock#allocateBuffer.
    assertEquals(0, ByteBuff.compareTo(bufferOne, 0, bufferOne.limit(), 
bufferTwo, 0 ,bufferTwo.limit()));
  } {code}
This test basically ensures that if you read the same block twice, they get the 
exact same underlying buffer. I think this is a good invariant to ensure.

> IOException during caching of uncompressed block to the block cache.
> --------------------------------------------------------------------
>
>                 Key: HBASE-27053
>                 URL: https://issues.apache.org/jira/browse/HBASE-27053
>             Project: HBase
>          Issue Type: Bug
>          Components: BlockCache
>    Affects Versions: 2.4.12
>            Reporter: Sergey Soldatov
>            Assignee: Sergey Soldatov
>            Priority: Major
>
> When prefetch to block cache is enabled and blocks are compressed sometimes 
> caching fails with the exception:
> {noformat}
> 2022-05-18 21:37:29,597 ERROR [RS_OPEN_REGION-regionserver/x1:16020-2] 
> regionserver.HRegion: Could not initialize all stores for the 
> region=cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7.
> 2022-05-18 21:37:29,598 WARN  [RS_OPEN_REGION-regionserver/x1:16020-2] 
> regionserver.HRegion: Failed initialize of region= 
> cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7., 
> starting to roll back memstore
> java.io.IOException: java.io.IOException: java.lang.RuntimeException: Cached 
> block contents differ, which should not have 
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1149)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1092)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:996)
>   at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:946)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7240)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegionFromTableDir(HRegion.java:7199)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7175)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7134)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7090)
>   at 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:147)
>   at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:100)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.RuntimeException: Cached block 
> contents differ, which should not have 
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.openStoreFiles(StoreEngine.java:294)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.initialize(StoreEngine.java:344)
>   at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:294)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6375)
>   at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1115)
>   at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1112)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.lang.RuntimeException: Cached block contents differ, which 
> should not have 
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at 
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.validateBlockAddition(BlockCacheUtil.java:199)
>   at 
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.shouldReplaceExistingCacheBlock(BlockCacheUtil.java:231)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.shouldReplaceExistingCacheBlock(BucketCache.java:447)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlockWithWait(BucketCache.java:432)
>   at 
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlock(BucketCache.java:418)
>   at 
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:60)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.lambda$readBlock$2(HFileReaderImpl.java:1319)
>   at java.util.Optional.ifPresent(Optional.java:159)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1317)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readAndUpdateNewBlock(HFileReaderImpl.java:942)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:931)
>   at 
> org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:171)
>   at 
> org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:321)
>   at org.apache.hadoop.hbase.regionserver.HStoreFile.open(HStoreFile.java:477)
>   at 
> org.apache.hadoop.hbase.regionserver.HStoreFile.initReader(HStoreFile.java:490)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.createStoreFileAndReader(StoreEngine.java:231)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreEngine.lambda$openStoreFiles$0(StoreEngine.java:272)
>   ... 6 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to