[
https://issues.apache.org/jira/browse/HBASE-26938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520016#comment-17520016
]
Andrew Kyle Purtell edited comment on HBASE-26938 at 4/9/22 7:01 PM:
---------------------------------------------------------------------
I have a draft PR open with the mentioned changes:
https://github.com/apache/hbase/pull/4334
Of the changes so far in the draft PR, the one worth keeping is the throw of
IllegalStateException when we detect that 'writer' is not null when it should
be. That change reduces the number of errors and improves the performance of
the test scenario quite a bit but does not address the root cause.
I think we are allowing concurrent compactions on the same store, that is the
issue.
If I use a StoreEngine write lock around the entire compaction, everything is
fine. Well, fine in that there are no compaction errors, not fine in that any
flush etc is also locked out for the duration of the compaction.
There is an interesting pattern to the ERRORs.
First, "writer exists when it should not"
store=05b67410742d42c6b2eee4fbcb6d3e79, thread=shortCompactions-9
Then, "Compaction failed"
store=05b67410742d42c6b2eee4fbcb6d3e79, thread=shortCompactions-9,
startTime=1649530250096
Then, another "Compaction failed"
store=05b67410742d42c6b2eee4fbcb6d3e79, thread=longCompactions-1,
startTime=1649530230724
Same store. Different thread names at each round of course, but the pattern of
more than one compaction with different startTime on the same store repeats.
{noformat}
2022-04-09T18:50:50,120 ERROR
[regionserver/ip-172-31-49-8:8120-shortCompactions-9] compactions.Compactor:
Writer exists when it should not: {
hdfs://ip-172-31-58-47.us-west-2.compute.internal:8020/hbase/data/default/IntegrationTestLoadCommonCrawl/05b67410742d42c6b2eee4fbcb6d3e79/.tmp/c/c8f82d828cec43ff80ef4c3d90936a0e
}
2022-04-09T18:50:50,127 ERROR
[regionserver/ip-172-31-49-8:8120-shortCompactions-9]
regionserver.CompactSplit: Compaction failed
region=IntegrationTestLoadCommonCrawl,com.bagf,1649529967159.05b67410742d42c6b2eee4fbcb6d3e79.,
storeName=05b67410742d42c6b2eee4fbcb6d3e79/c, priority=-3,
startTime=1649530250096
2022-04-09T18:51:02,842 ERROR
[regionserver/ip-172-31-49-8:8120-longCompactions-1] regionserver.CompactSplit:
Compaction failed
region=IntegrationTestLoadCommonCrawl,com.bagf,1649529967159.05b67410742d42c6b2eee4fbcb6d3e79.,
storeName=05b67410742d42c6b2eee4fbcb6d3e79/c, priority=-3,
startTime=1649530230724{noformat}
etc.
was (Author: apurtell):
I have a draft PR open with the mentioned changes:
https://github.com/apache/hbase/pull/4334
Of the changes so far in the draft PR, the one worth keeping is the throw of
IllegalStateException when we detect that 'writer' is not null when it should
be. That change reduces the number of errors and improves the performance of
the test scenario quite a bit but does not address the root cause.
Compactor instances are being inappropriately shared or reused. The "writer
exists when it should not" error is logged and that compaction is failed
immediately, but another compaction in another thread also fails, typically.
There is an interesting pattern to the ERRORs.
First, "writer exists when it should not"
store=05b67410742d42c6b2eee4fbcb6d3e79, thread=shortCompactions-9
Then, "Compaction failed"
store=05b67410742d42c6b2eee4fbcb6d3e79, thread=shortCompactions-9,
startTime=1649530250096
Then, another "Compaction failed"
store=05b67410742d42c6b2eee4fbcb6d3e79, thread=longCompactions-1,
startTime=1649530230724
Same store. Different thread names at each round of course, but the pattern of
more than one compaction with different startTime on the same store repeats.
{noformat}
2022-04-09T18:50:50,120 ERROR
[regionserver/ip-172-31-49-8:8120-shortCompactions-9] compactions.Compactor:
Writer exists when it should not: {
hdfs://ip-172-31-58-47.us-west-2.compute.internal:8020/hbase/data/default/IntegrationTestLoadCommonCrawl/05b67410742d42c6b2eee4fbcb6d3e79/.tmp/c/c8f82d828cec43ff80ef4c3d90936a0e
}
2022-04-09T18:50:50,127 ERROR
[regionserver/ip-172-31-49-8:8120-shortCompactions-9]
regionserver.CompactSplit: Compaction failed
region=IntegrationTestLoadCommonCrawl,com.bagf,1649529967159.05b67410742d42c6b2eee4fbcb6d3e79.,
storeName=05b67410742d42c6b2eee4fbcb6d3e79/c, priority=-3,
startTime=1649530250096
2022-04-09T18:51:02,842 ERROR
[regionserver/ip-172-31-49-8:8120-longCompactions-1] regionserver.CompactSplit:
Compaction failed
region=IntegrationTestLoadCommonCrawl,com.bagf,1649529967159.05b67410742d42c6b2eee4fbcb6d3e79.,
storeName=05b67410742d42c6b2eee4fbcb6d3e79/c, priority=-3,
startTime=1649530230724{noformat}
etc.
> Compaction failures after StoreFileTracker integration (branch-2, branch-2.5)
> -----------------------------------------------------------------------------
>
> Key: HBASE-26938
> URL: https://issues.apache.org/jira/browse/HBASE-26938
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.5.0, 2.6.0
> Reporter: Andrew Kyle Purtell
> Priority: Blocker
> Fix For: 2.5.0
>
>
> [ Currently this has only been tested with branch-2.5 and branch-2. Testing
> with master next, will update afterward. ]
> Test cluster of 10 regionservers is configured each RS with 5 flush threads,
> 5 large compaction threads, and 10 small compaction threads.
> Hadoop is 3.3.2. Java is 11. HFiles are on HDFS.
> All the StoreFileTracker implementations, DEFAULT or FILE, exhibit compaction
> time store writer errors in an ingest heavy use case. Unit tests don't seem
> to cover whatever this is. Most compactions succeed, but some do not. Those
> that do not are failing with state or sanity check assertions. Below errors
> are all from DEFAULT. They seem related... store writer instance
> usage/close/locking issues during compactions.
> Warnings like "writer exists when it should not":
> {noformat}
> 2022-04-07T23:13:11,351 WARN
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8]
> compactions.Compactor: Writer exists when it should not: {
>
> hdfs://ip-172-31-58-47.us-west-2.compute.internal:8020/hbase/data/default/IntegrationTestLoadCommonCrawl/b518f72941d4427e7e1923407643df67/.tmp/c/29d7b88c4c214ddcbba4f747514a2cf5
> }
> {noformat}
> Errors like:
> IllegalStateException thrown from
> HFileBlockIndex$BlockIndexWriter.shouldWriteBlock:
> {noformat}
> 2022-04-07T23:13:11,508 ERROR
> [regionserver/ip-172-31-63-83:8120-shortCompactions-6]
> regionserver.CompactSplit: Compaction failed
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
> storeName=b518f72941d4427e7e1923407643df67/c, priority=10,
> startTime=1649373185476
> java.lang.IllegalStateException: curInlineChunk is null; has shouldWriteBlock
> been called with closing=true and then called again?
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.shouldWriteBlock(HFileBlockIndex.java:1258)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.writeInlineBlocks(HFileWriterImpl.java:523)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:608)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:84)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:76)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:384)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> {noformat}
> and IllegalStateException thrown from HFileBlock$Writer.expectState:
> {noformat}
> 2022-04-07T23:13:11,559 ERROR
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8]
> regionserver.CompactSplit: Compaction failed
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
> storeName=b518f72941d4427e7e1923407643df67/c, priority=0,
> startTime=1649373191325
> java.lang.IllegalStateException: Expected state: BLOCK_READY, actual state:
> WRITING
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.expectState(HFileBlock.java:1190)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.getOnDiskSizeWithHeader(HFileBlock.java:1106)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishBlock(HFileWriterImpl.java:346)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.checkBlockBoundary(HFileWriterImpl.java:327)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.append(HFileWriterImpl.java:739)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:301)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:456)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:364)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1138)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2392)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:656)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:702)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> {noformat}
> and the dreaded "added a key not lexically larger than previous":
> {noformat}
> 2022-04-07T23:13:14,715 ERROR
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8]
> regionserver.CompactSplit: Compaction failed
> region=IntegrationTestLoadCommonCrawl,de.bao,1649373172576.083eb1ede8bf8c82174f614b06d4741e.,
> storeName=083eb1ede8bf8c82174f614b06d4741e/c, priority=-1,
> startTime=1649373194643
> java.io.IOException: Added a key not lexically larger than previous.
> ...
> at
> org.apache.hadoop.hbase.util.BloomContext.sanityCheck(BloomContext.java:63)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.util.BloomContext.writeBloom(BloomContext.java:54)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.appendGeneralBloomfilter(StoreFileWriter.java:280)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:299)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:456)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:364)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1138)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2392)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:656)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:702)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> {noformat}
> and another bloomfilter addition sanity check failure:
> {noformat}
> 2022-04-07T23:23:44,001 ERROR
> [regionserver/ip-172-31-49-8:8120-longCompactions-4]
> regionserver.CompactSplit: Compaction failed
> region=IntegrationTestLoadCommonCrawl,com.fc2.blog.idp,1649373775847.9d960338d85e2bb9c68669fad6a89f73.,
> storeName=9d960338d85e2bb9c68669fad6a89f73/c, priority=-1,
> startTime=1649373823967
> java.lang.IllegalStateException: First key in chunk already set:
> com.filmestipo|/film/1335-algures-hoje-a-noite|1649373441720
> at
> org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter.append(CompoundBloomFilterWriter.java:174)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.util.BloomContext.writeBloom(BloomContext.java:55)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.appendGeneralBloomfilter(StoreFileWriter.java:280)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:299)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:456)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:364)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1138)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2392)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:656)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:702)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> {noformat}
> and a NPE in HFileBlock$Writer.getEncodingState:
> {noformat}
> 2022-04-07T23:17:26,373 ERROR
> [regionserver/ip-172-31-63-83:8120-shortCompactions-6]
> regionserver.CompactSplit: Compaction failed
> region=IntegrationTestLoadCommonCrawl,com.fc2.blog.idp,1649373315303.2bd16fac0c088fa135a855f0165f1dba.,
> storeName=2bd16fac0c088fa135a855f0165f1dba/c, priority=10,
> startTime=1649373438988
> java.lang.NullPointerException: null
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.getEncodingState(HFileBlock.java:837)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.beforeShipped(HFileBlock.java:831)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.beforeShipped(HFileWriterImpl.java:770)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.beforeShipped(StoreFileWriter.java:309)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:515)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:364)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1138)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2392)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:656)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:702)
> ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)