Thanks for the response!
On Thu, Feb 15, 2018 at 11:52 PM, Stack <[email protected]> wrote: > On Wed, Feb 14, 2018 at 3:19 PM, Zach York <[email protected]> > wrote: > > > Hello, > > > > I have been trying for some time to figure out an issue where splits fail > > due to trying to cache the same key, but different value. > > > Whats the exception look like Mighty Zach? > 2018-01-12 16:45:17,263 INFO [regionserver/ip-172-31-46-118.us-west-2.compute.internal/172.31.46.118:16020-splits-1515775482373] regionserver.SplitRequest: Running rollback/cleanup of failed split of test,9hp9mum6k2r9j19dbvax2s35f0eaoxzb8nqwz6*024566245231038769,1510040627287.6d1496d3c9ac6ebc09b02b7b959e5c5f.; Failed <ip>,16020,1515773679237-daughterOpener=c5ff63db3b59a7c8ebe7491811a28660 java.io.IOException: Failed <ip>,16020,1515773679237-daughterOpener=c5ff63db3b59a7c8ebe7491811a28660 at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughters(SplitTransactionImpl.java:504) at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.stepsAfterPONR(SplitTransactionImpl.java:597) at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(SplitTransactionImpl.java:580) at org.apache.hadoop.hbase.regionserver.SplitRequest.doSplitting(SplitRequest.java:82) at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: java.io.IOException: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:test=c46e62f66866b1df8d7e18c8551dbec3-3908e0f20ad3410aadf4e4e8809f90f4.6d1496d3c9ac6ebc09b02b7b959e5c5f_131435429 at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:954) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:829) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:804) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6717) at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughterRegion(SplitTransactionImpl.java:731) at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl$DaughterOpener.run(SplitTransactionImpl.java:711) ... 1 more Caused by: java.io.IOException: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:test=c46e62f66866b1df8d7e18c8551dbec3-3908e0f20ad3410aadf4e4e8809f90f4.6d1496d3c9ac6ebc09b02b7b959e5c5f_131435429 at org.apache.hadoop.hbase.regionserver.HStore.openStoreFiles(HStore.java:560) at org.apache.hadoop.hbase.regionserver.HStore.loadStoreFiles(HStore.java:514) at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:277) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5194) at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:928) at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:925) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more Caused by: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:test=c46e62f66866b1df8d7e18c8551dbec3-3908e0f20ad3410aadf4e4e8809f90f4.6d1496d3c9ac6ebc09b02b7b959e5c5f_131435429 at org.apache.hadoop.hbase.io.hfile.LruBlockCache.cacheBlock(LruBlockCache.java:368) at org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:66) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:462) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:601) at org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:190) at org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:361) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:511) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518) at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667) at org.apache.hadoop.hbase.regionserver.HStore.access$000(HStore.java:119) at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:534) at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:531) ... 6 more > > ... > > > > What I know/have observed: > > This occurs when prefetch on open is true > > Both daugher store openers will try to access and cache the first and > last > > values in the store (HalfStoreFile). [1] [2] [3] > > Both daugher store openers will access the same key (the splitkey, I > > believe). > > > > > The lastkey in one daughter will be the firstkey in the other? > > > I'm still trying to figure out the logic, but it looks like in getLastKey() it seeks to the split key and then backs up if it needs to. [1] passes the split key into seekBefore() which then goes into [2] where it tries to seek to that block and then it has logic to get the previous block if necessary. [1] https://github.com/apache/hbase/blob/rel/1.3.1/hbase-server/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java#L307 [2] https://github.com/apache/hbase/blob/rel/1.3.1/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java#L668 > > > > * However they will access two different paths and retrieve two different > > values. > > > > Different values? Or is that the same path (the half store files resolve to > same backing storefile) but different values somehow? > > > It is the same path according to the fsBlockReader. I'm not sure if the half files would have resolved at that time or not. See these two log lines that I printed out (where thread is the thread ID): Note the different paths in the fsBlockReader and different thread ids: 2018-02-20 22:41:27,741 WARN [StoreFileOpenerThread-discrete-1] hfile.HFileReaderV2: concurrent: 0, thread: 3598, dataBlockOffset: 488903844 onDiskBlockSize: 97961, pread: true, fsBlockReader: hfs=org.apache.hadoop.hbase.fs.HFileSystem@d65b38e, path=s3://<bucket>/hbase/data/default/test-1/feade760d02a57f5a99acfb6bf78e0ff/cf/2d0851b27c0143a1901e253c59d44129.b59b89676a973eede358ef5e7897b4d4, fileContext=HFileContext [ usesHBaseChecksum=true checksumType=CRC32C bytesPerChecksum=16384 blocksize=65536 encoding=NONE includesMvcc=true includesTags=false compressAlgo=SNAPPY compressTags=false cryptoContext=[ cipher=NONE keyHash=NONE ] ] 2018-02-20 22:41:27,821 WARN [StoreFileOpenerThread-discrete-1] hfile.HFileReaderV2: concurrent: 0, thread: 3597, dataBlockOffset: 488903844 onDiskBlockSize: 97961, pread: true, fsBlockReader: hfs=org.apache.hadoop.hbase.fs.HFileSystem@d65b38e, path=s3://<bucket>/hbase/data/default/test-1/8a61a1839a150ee520644a054efc41ef/cf/2d0851b27c0143a1901e253c59d44129.b59b89676a973eede358ef5e7897b4d4, fileContext=HFileContext [ usesHBaseChecksum=true checksumType=CRC32C bytesPerChecksum=16384 blocksize=65536 encoding=NONE includesMvcc=true includesTags=false compressAlgo=SNAPPY compressTags=false cryptoContext=[ cipher=NONE keyHash=NONE ] ] > > > Then when they try to cache these values, it fails because the other > thread > > cached it's value and the two values aren't equivalent. [4] > > This is past the point of no return for the region split so it tries to > > roll forward, but fails. These regions are then corrupted and cannot be > > recovered. > > > > One thing I have notices is that these blocks are all leaf index blocks. > I > > guess these index blocks would be different if it was in a different > file. > > This is on HBase 1.3.1. > > > > It is definitely a concurrency bug as both daughter openers need to try > to > > get the block from disk before the other caches the block. Why would it > be > > trying to access two different paths for the same key? Is that normal in > a > > reference file? > > > > > The two paths are probably the two reference file paths, one for each > daughter. That the paths are different should be fine (even if same value); > the key the block cache uses is the based off filename. > > But you seem to be saying the key is the same so ultimately the file and > offset we are getting the value with -- used to make the block cache key -- > are the same but somehow the value gotten differs? > > > > > Does anyone have any idea as to what might be occurring? > > If you need more info, I can show you the code path that it is running > > through. > > > > > Say more Zach. Perhaps we can help out. > > Thanks, > > S > > > I'm continuing to do more investigation and will update more as I go along. Any insight is appreciated Thanks, Zach
