[
https://issues.apache.org/jira/browse/HBASE-27053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564026#comment-17564026
]
Bryan Beaudreault edited comment on HBASE-27053 at 7/8/22 1:49 AM:
-------------------------------------------------------------------
I've been digging into this. Relaying some of my findings from comments on the
PR:
The problem here is that in HFileBlock#allocateBuffer, we are including space
for the checksums in the unpacked buffer. But then, when we call
HFileBlockDecodingContext#prepareEncoding, we pass in
getUncompressedSizeWithoutHeader() as the uncompressedSize, which does not
include checksum length.
When using BucketCache, all HFileBlock buffers are allocated through the
ByteBuffAllocactor, which pools them for re-use. When a buffer is pulled from
the pool, clear() is called on the buffer, but that doesn't zero out the bytes.
It just resets the pointers and limits. The expectation is that if you pull a
buffer from the pool for a particular capacity, you should write the entire
capacity of that buffer. Since uncompressedSize is smaller than the buffer
capacity, we aren't doing that and end up with junk at the end where the
checksum should be.
This is happening for all HFileBlock buffers, when BucketCache is enabled. But
this typically doesn't matter, because the checksum is not used after reading
from disk. In fact, most usages of the HFileBlock buffer get it through
HFileBlock#getBufferWithoutHeader(), which also excludes the checksum section.
The reason it matters for this case, is because when splits occur its possible
for the split key to be in the middle of a block, resulting in that block being
loaded by both daughter regions on open. This happens in 2 separate threads, so
a race condition occurs. This race condition was mitigated in HBASE-20447, by
validating that the to-be-cached block's contents matches the block contents
that is already in the cache. This is typically the case, but due to the above
allocation bug, the two blocks each have different junk at the end of the
buffer, causing the IOException.
I think the best fix here would be to make HFileBlock#allocateBuffer not
include the checksum bytes in the capacity. The only challenge with that is we
need to then handle the case in getBufferWithoutHeader() so we properly set the
limit. Currently getBufferWithoutHeader() always passes {{false}} into
{{{}getBufferWithoutHeader(boolean withChecksum){}}}. I think this would be a
matter of changing {{false}} to {{{}!isUnpacked(){}}}. Since packed blocks
would always have the checksum available, and unpacked would never.
So my suggestion ends up being a two-line change:
* in HFileBlock#allocateBuffer(), remove {{cksumBytes}} from capacity
* In HFileBlock#getBufferWithoutHeader(), pass {{!isUnpacked()}} instead of
{{{}false{}}}.
We can validate this functionality with a new test in TestHFileBlockUnpack:
{code:java}
@Test
public void itUnpacksIdenticallyEachTime() throws IOException {
Path path = new Path(TEST_UTIL.getDataTestDir(), name.getMethodName());
int totalSize = createTestBlock(path);
// Allocate a bunch of random buffers, so we can be sure that unpack will
only have "dirty"
// buffers to choose from when allocating itself.
Random random = new Random();
byte[] temp = new byte[HConstants.DEFAULT_BLOCKSIZE];
List<ByteBuff> buffs = new ArrayList<>();
for (int i = 0; i < 10; i++) {
ByteBuff buff = allocator.allocate(HConstants.DEFAULT_BLOCKSIZE);
random.nextBytes(temp);
buff.put(temp);
buffs.add(buff);
}
buffs.forEach(ByteBuff::release);
// read the same block twice. we should expect the underlying buffer below
to
// be identical each time
HFileBlock blockOne = readBlock(path, totalSize);
HFileBlock blockTwo = readBlock(path, totalSize);
ByteBuff bufferOne = blockOne.getBufferWithoutHeader(true);
ByteBuff bufferTwo = blockTwo.getBufferWithoutHeader(true);
// This assertion should succeed, but it fails on master. It will succeed
once
// cksumBytes is removed from HFileBlock#allocateBuffer.
assertEquals(0, ByteBuff.compareTo(bufferOne, 0, bufferOne.limit(),
bufferTwo, 0 ,bufferTwo.limit()));
} {code}
This test basically ensures that if you read the same block twice, they get the
exact same underlying buffer. I think this is a good invariant to ensure.
was (Author: bbeaudreault):
I've been digging into this. Relaying some of my findings from comments on the
PR:
The problem here is that in HFileBlock#allocateBuffer, we are including space
for the checksums in the unpacked buffer. But then, when we call
HFileBlockDecodingContext#prepareEncoding, we pass in
getUncompressedSizeWithoutHeader() as the uncompressedSize, which does not
include checksum length. This mismatch causes all unpacked HFileBlock buffers
to have a section of un-set bytes at the end of each buffer.
When using BucketCache, all HFileBlock buffers are allocated through the
ByteBuffAllocactor, which pools them for re-use. When a buffer is pulled from
the pool, clear() is called on the buffer, but that doesn't zero out the bytes.
It just resets the pointers and limits. As a result, you pull a buffer from the
pool which previously had bytes in the section of the buffer that happens to
hold the un-set checksum, you end up with junk in the checksum portion.
This is happening for all HFileBlock buffers, when BucketCache is enabled. But
this typically doesn't matter, because the checksum is not used after reading
from disk. In fact, most usages of the HFileBlock buffer get it through
HFileBlock#getBufferWithoutHeader(), which excludes the checksum section.
The reason it matters for this case, is because when splits occur its possible
for the split key to be in the middle of a block, resulting in that block being
loaded by both daughter regions on open. This happens in 2 separate threads, so
a race condition occurs. This wouldn't be a problem if the 2 blocks were
identical, but in this case they each have different junk at the end of the
buffer, in the checksum area. This race condition was detected and mitigated in
HBASE-20447 by validating the contents of the conflicting block. Since the
checksum contains junk, we are hitting that case where contents differ often
enough to cause frequent IOExceptions on split.
I think the best fix here would be to make HFileBlock#allocateBuffer not
include the checksum bytes in the capacity. The only challenge with that is we
need to then handle the case in getBufferWithoutHeader() so we properly set the
limit. Currently getBufferWithoutHeader() always passes {{false}} into
{{{}getBufferWithoutHeader(boolean withChecksum){}}}. I think this would be a
matter of changing {{false}} to {{{}!isUnpacked(){}}}. Since packed blocks
would always have the checksum available, and unpacked would never.
So my suggestion ends up being a two-line change:
* in HFileBlock#allocateBuffer(), remove {{cksumBytes}} from capacity
* In HFileBlock#getBufferWithoutHeader(), pass {{!isUnpacked()}} instead of
{{{}false{}}}.
We can validate this functionality with a new test in TestHFileBlockUnpack:
{code:java}
@Test
public void itUnpacksIdenticallyEachTime() throws IOException {
Path path = new Path(TEST_UTIL.getDataTestDir(), name.getMethodName());
int totalSize = createTestBlock(path);
// Allocate a bunch of random buffers, so we can be sure that unpack will
only have "dirty"
// buffers to choose from when allocating itself.
Random random = new Random();
byte[] temp = new byte[HConstants.DEFAULT_BLOCKSIZE];
List<ByteBuff> buffs = new ArrayList<>();
for (int i = 0; i < 10; i++) {
ByteBuff buff = allocator.allocate(HConstants.DEFAULT_BLOCKSIZE);
random.nextBytes(temp);
buff.put(temp);
buffs.add(buff);
}
buffs.forEach(ByteBuff::release);
// read the same block twice. we should expect the underlying buffer below
to
// be identical each time
HFileBlock blockOne = readBlock(path, totalSize);
HFileBlock blockTwo = readBlock(path, totalSize);
ByteBuff bufferOne = blockOne.getBufferWithoutHeader(true);
ByteBuff bufferTwo = blockTwo.getBufferWithoutHeader(true);
// This assertion should succeed, but it fails on master. It will succeed
once
// cksumBytes is removed from HFileBlock#allocateBuffer.
assertEquals(0, ByteBuff.compareTo(bufferOne, 0, bufferOne.limit(),
bufferTwo, 0 ,bufferTwo.limit()));
} {code}
This test basically ensures that if you read the same block twice, they get the
exact same underlying buffer. I think this is a good invariant to ensure.
> IOException during caching of uncompressed block to the block cache.
> --------------------------------------------------------------------
>
> Key: HBASE-27053
> URL: https://issues.apache.org/jira/browse/HBASE-27053
> Project: HBase
> Issue Type: Bug
> Components: BlockCache
> Affects Versions: 2.4.12
> Reporter: Sergey Soldatov
> Assignee: Sergey Soldatov
> Priority: Major
>
> When prefetch to block cache is enabled and blocks are compressed sometimes
> caching fails with the exception:
> {noformat}
> 2022-05-18 21:37:29,597 ERROR [RS_OPEN_REGION-regionserver/x1:16020-2]
> regionserver.HRegion: Could not initialize all stores for the
> region=cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7.
> 2022-05-18 21:37:29,598 WARN [RS_OPEN_REGION-regionserver/x1:16020-2]
> regionserver.HRegion: Failed initialize of region=
> cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7.,
> starting to roll back memstore
> java.io.IOException: java.io.IOException: java.lang.RuntimeException: Cached
> block contents differ, which should not have
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1149)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1092)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:996)
> at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:946)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7240)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegionFromTableDir(HRegion.java:7199)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7175)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7134)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7090)
> at
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:147)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:100)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.RuntimeException: Cached block
> contents differ, which should not have
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.openStoreFiles(StoreEngine.java:294)
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.initialize(StoreEngine.java:344)
> at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:294)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6375)
> at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1115)
> at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1112)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ... 3 more
> Caused by: java.lang.RuntimeException: Cached block contents differ, which
> should not have
> happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
> at
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.validateBlockAddition(BlockCacheUtil.java:199)
> at
> org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.shouldReplaceExistingCacheBlock(BlockCacheUtil.java:231)
> at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.shouldReplaceExistingCacheBlock(BucketCache.java:447)
> at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlockWithWait(BucketCache.java:432)
> at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlock(BucketCache.java:418)
> at
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:60)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.lambda$readBlock$2(HFileReaderImpl.java:1319)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1317)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readAndUpdateNewBlock(HFileReaderImpl.java:942)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:931)
> at
> org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:171)
> at
> org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:321)
> at org.apache.hadoop.hbase.regionserver.HStoreFile.open(HStoreFile.java:477)
> at
> org.apache.hadoop.hbase.regionserver.HStoreFile.initReader(HStoreFile.java:490)
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.createStoreFileAndReader(StoreEngine.java:231)
> at
> org.apache.hadoop.hbase.regionserver.StoreEngine.lambda$openStoreFiles$0(StoreEngine.java:272)
> ... 6 more
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)