gaozhan ding created HDFS-15732:
-----------------------------------

             Summary: EC client will not retry get block token when block token 
expired  in kerberized cluster
                 Key: HDFS-15732
                 URL: https://issues.apache.org/jira/browse/HDFS-15732
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: dfsclient, ec, erasure-coding
    Affects Versions: 3.1.1
         Environment: hadoop 3.1.1

kerberos

ec RS-3-2-1024k
            Reporter: gaozhan ding


When enable ec policy on hbase, we got some issues. Once block token was 
expired in datanode side, client side will not identify the InvalidToken error 
because of the SASL negotiation. As a result, ec client will not do retry by 
refetch token when create blockreader.  Then the peer datanode was added to 
DeadNodes, and all calls to function createBlockReader aim at this datanode in 
current DFSStripedInputStream will consider this datanode  was dead and return 
false. The finally result is a read failure.

Some logs :

hbase regionserver:

2020-12-17 10:00:24,291 WARN 
[RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: 
Failed to connect to /10.65.19.41:9866 for 
blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
java.io.IOException: DIGEST-MD5: IO error acquiring password
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
 at 
org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
 at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
 at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
 at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
 at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
 at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
 at 
org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
 at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
 at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
 at 
org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
 at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
 at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
 at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
 at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
 at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
 at 
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395)
 at 
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:250)
 at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2031)
 at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2022)
 at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6408)
 at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:6388)
 at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2926)

 

 

datanode log:

2020-12-17 10:00:24,290 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
hadoop-btdn0005.eniot.io:9866:DataXceiver error processing unknown operation 
src: /10.65.19.42:53894 dst: /10.65.19.41:9866
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289, 
userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026, 
blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, 
DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, 
DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, 
DS-a1ee0117-7430-4181-b279-3f1b16541567, 
DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, 
DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.]
 at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
 at 
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289, 
userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026, 
blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, 
DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, 
DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, 
DS-a1ee0117-7430-4181-b279-3f1b16541567, 
DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, 
DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.
 at 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:479)
 at 
org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:81)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
 at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
 ... 7 more

 

final result

 

 

java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for reader 
reader=hdfs://azbeta/hbase/data/o15427722038191/RAW_569dfc35-10dc-4fe4-bb8c-5c403368d6c1/ccf146fb5c852d751b6b2f8307f81fa0/d/386bdb6978d348f4b4e168e728762842,
 compression=snappy, cacheConf=blockCache=LruBlockCache\{blockCount=7915, 
currentSize=489.13 MB, freeSize=2.71 GB, maxSize=3.19 GB, heapSize=489.13 MB, 
minSize=3.03 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, 
singleSize=775.81 MB, singleFactor=0.25}, cacheDataOnRead=true, 
cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, 
cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, 
firstKey=Optional[\x01\x00\x00\x13\xBD\x00\x00\x00T\x070\x97\xF7/d:arm_xingneng_test_model@AI1/1603445985858/Put/seqid=0],
 
lastKey=Optional[\x01\x00\x00\x16\xF9\x00\x00\x00T\x073-8/d:arm_xingneng_test_model@AI99/1603446144699/Put/seqid=0],
 avgKeyLen=53, avgValueLen=4, entries=2016557, length=22957603, 
cur=\x01\x00\x00\x14#\x00\x00\x00T\x072\xCC\xF7/d:arm_xingneng_test_model@AI84/1603446131146/Put/vlen=4/seqid=17230816]
 at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:217)
 at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120)
 at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
 at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:388)
 at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327)
 at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
 at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
 at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1410)
 at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2187)
 at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:596)
 at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:638)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 3 missing blocks, the stripe is: 
AlignedStripe(Offset=1048576, length=1048576, fetchedChunksNum=0, 
missingChunksNum=3); locatedBlocks is: LocatedBlocks\{; fileLength=22957603; 
underConstruction=false; 
blocks=[LocatedStripedBlock{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
 getBlockSize()=22957603; corrupt=false; offset=0; 
locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
 
DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
 
DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
 
DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
 
DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
 indices=[0, 1, 2, 3, 4]}]; 
lastLocatedBlock=LocatedStripedBlock\{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
 getBlockSize()=22957603; corrupt=false; offset=0; 
locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
 
DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
 
DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
 
DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
 
DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
 indices=[0, 1, 2, 3, 4]}; isLastBlockComplete=true; 
ecPolicy=ErasureCodingPolicy=[Name=RS-3-2-1024k, Schema=[ECSchema=[Codec=rs, 
numDataUnits=3, numParityUnits=2]], CellSize=1048576, Id=2]}
 at 
org.apache.hadoop.hdfs.StripeReader.checkMissingBlocks(StripeReader.java:177)
 at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:372)
 at 
org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:324)
 at 
org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:397)
 at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
 at java.io.DataInputStream.read(DataInputStream.java:149)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:765)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1562)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
 at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:931)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.isNextBlock(HFileReaderImpl.java:1064)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.positionForNextBlock(HFileReaderImpl.java:1058)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1076)
 at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097)
 at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208)
 ... 13 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to