[ https://issues.apache.org/jira/browse/HBASE-21607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721914#comment-16721914 ]
Lars Hofhansl commented on HBASE-21607: --------------------------------------- This of course is rare case... Choosing 0 read from a datastore is just bad practice, though. Not sure how we can change it. I suppose we could have a new HFile version and have 0 indicate an error and pick only non-zero (and non-255) value. > HBase Region server read fails in case of datanode disk error. > -------------------------------------------------------------- > > Key: HBASE-21607 > URL: https://issues.apache.org/jira/browse/HBASE-21607 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.3.2 > Reporter: Rushabh S Shah > Priority: Major > > Hbase region server reads failed with following error. > {noformat} > 2018-11-30 16:49:18,760 WARN [,queue=12,port=60020] hdfs.BlockReaderFactory - > BlockReaderFactory(fileName=<fileName>, > block=BP-1618467445-<ip>-1516873463430:blk_1090719164_16982933): error > creating ShortCircuitReplica. > java.io.IOException: invalid metadata header version 0. Can only handle > version 1. > at > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:129) > at > org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:558) > at > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:490) > at > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:782) > at > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:716) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1145) > at > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1087) > at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1444) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1407) > at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:89) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:834) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1530) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1781) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1624) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:455) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1263) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:189) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:372) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:220) > at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2164) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:5916) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5890) > at > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2739) > at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2719) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7197) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7156) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7149) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2250) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:35068) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2373) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) > > 2018-11-30 16:49:18,760 WARN [,queue=12,port=60020] > shortcircuit.ShortCircuitCache - ShortCircuitCache(0x16fd768f): failed to > load 1090719164_BP-1618467445-<ip>-1516873463430 > 2018-11-30 16:49:18,761 DEBUG [,queue=12,port=60020] ipc.RpcServer - > RpcServer.FifoWFPBQ.default.handler=246,queue=12,port=60020: callId: 46940 > service: ClientService methodName: Get size: 443 connection: <ip>:48798 > deadline: 1543596678759 > 2018-11-30 16:49:18,761 DEBUG [,queue=12,port=60020] ipc.RpcServer - > RpcServer.FifoWFPBQ.default.handler=246,queue=12,port=60020: callId: 46940 > service: ClientService methodName: Get size: 443 connection: <ip>:48798 > deadline: 1543596678759 > java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader > reader=hdfs://<fileName>, compression=none, cacheConf=bl > ockCache=LruBlockCache{blockCount=696858, currentSize=7625195376, > freeSize=5689203856, maxSize=13314399232, heapSize=7625195376, > minSize=12648679424, minFactor=0.95, multiSize=6324339712, multiFactor=0.5, > singleSize=3162169856, singleFactor=0.25}, cacheDa > taOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, > cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, > prefetchOnOpen=false, firstKey=<keyName> > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:207) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:372) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:220) > at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2164) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:5916) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5890) > at > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2739) > at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2719) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7197) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7156) > at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7149) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2250) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:35068) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2373) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) > Caused by: java.io.IOException: Invalid HFile block magic: > \x00\x00\x00\x00\x00\x00\x00\x00 > at org.apache.hadoop.hbase.io.hfile.BlockType.parse(BlockType.java:155) > at org.apache.hadoop.hbase.io.hfile.BlockType.read(BlockType.java:167) > at org.apache.hadoop.hbase.io.hfile.HFileBlock.<init>(HFileBlock.java:345) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1821) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1624) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:455) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1263) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:189) > ... 16 more > {noformat} > TheĀ "Invalid HFile block magic" exception was thrown all the way back to the > client. > The first stack trace indicates that hdfs short circuit read failed and then > it fell back to hdfs local rpc read and that succeeded. > The underlying reason is there was a write error in one of the disks. The > datanode was receiving and mirroring(to downstream nodes in the pipeline) > just fine but was writing garbage data(all \x00\x00\x00\x00\x00\x00\x00\x00) > to disk. Datanode hasn't detected this disk as bad yet. > We have hbase checksum turned on and hdfs checksum evaluation turned off. > While reading this erroneous block, datanode didn't do any checksum > validation and streamed all the data as is to region server and relied on > hbase checksum validation to catch it. But we have a special case in hbase > checksum validation that if the checksumType byte is 0, then it is NULL and > considers that checksum validation is not needed for this block. Refer to > [ChecksumUtil|https://github.com/apache/hbase/blob/branch-1.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/ChecksumUtil.java#L95] > and > [ChecksumType|https://github.com/apache/hbase/blob/branch-1.3/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ChecksumType.java#L33]. > Thanks [~lhofhansl] and [~karanmehta93] for help debugging this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)