gaozhan ding created HDFS-15732:
-----------------------------------
Summary: EC client will not retry get block token when block token
expired in kerberized cluster
Key: HDFS-15732
URL: https://issues.apache.org/jira/browse/HDFS-15732
Project: Hadoop HDFS
Issue Type: Bug
Components: dfsclient, ec, erasure-coding
Affects Versions: 3.1.1
Environment: hadoop 3.1.1
kerberos
ec RS-3-2-1024k
Reporter: gaozhan ding
When enable ec policy on hbase, we got some issues. Once block token was
expired in datanode side, client side will not identify the InvalidToken error
because of the SASL negotiation. As a result, ec client will not do retry by
refetch token when create blockreader. Then the peer datanode was added to
DeadNodes, and all calls to function createBlockReader aim at this datanode in
current DFSStripedInputStream will consider this datanode was dead and return
false. The finally result is a read failure.
Some logs :
hbase regionserver:
2020-12-17 10:00:24,291 WARN
[RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient:
Failed to connect to /10.65.19.41:9866 for
blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
java.io.IOException: DIGEST-MD5: IO error acquiring password
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
at
org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
at
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
at
org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
at
org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
at
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:250)
at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2031)
at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2022)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6408)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:6388)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2926)
datanode log:
2020-12-17 10:00:24,290 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
hadoop-btdn0005.eniot.io:9866:DataXceiver error processing unknown operation
src: /10.65.19.42:53894 dst: /10.65.19.41:9866
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block
token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289,
userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026,
blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK,
DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd,
DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a,
DS-a1ee0117-7430-4181-b279-3f1b16541567,
DS-e8459904-e1e1-44a6-affa-9ce9fc56f877,
DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.]
at
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
at
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block
token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289,
userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026,
blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK,
DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd,
DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a,
DS-a1ee0117-7430-4181-b279-3f1b16541567,
DS-e8459904-e1e1-44a6-affa-9ce9fc56f877,
DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.
at
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:479)
at
org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:81)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
at
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
... 7 more
final result
java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for reader
reader=hdfs://azbeta/hbase/data/o15427722038191/RAW_569dfc35-10dc-4fe4-bb8c-5c403368d6c1/ccf146fb5c852d751b6b2f8307f81fa0/d/386bdb6978d348f4b4e168e728762842,
compression=snappy, cacheConf=blockCache=LruBlockCache\{blockCount=7915,
currentSize=489.13 MB, freeSize=2.71 GB, maxSize=3.19 GB, heapSize=489.13 MB,
minSize=3.03 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5,
singleSize=775.81 MB, singleFactor=0.25}, cacheDataOnRead=true,
cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false,
cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false,
firstKey=Optional[\x01\x00\x00\x13\xBD\x00\x00\x00T\x070\x97\xF7/d:arm_xingneng_test_model@AI1/1603445985858/Put/seqid=0],
lastKey=Optional[\x01\x00\x00\x16\xF9\x00\x00\x00T\x073-8/d:arm_xingneng_test_model@AI99/1603446144699/Put/seqid=0],
avgKeyLen=53, avgValueLen=4, entries=2016557, length=22957603,
cur=\x01\x00\x00\x14#\x00\x00\x00T\x072\xCC\xF7/d:arm_xingneng_test_model@AI84/1603446131146/Put/vlen=4/seqid=17230816]
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:217)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:388)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327)
at
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
at
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1410)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2187)
at
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:596)
at
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:638)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 3 missing blocks, the stripe is:
AlignedStripe(Offset=1048576, length=1048576, fetchedChunksNum=0,
missingChunksNum=3); locatedBlocks is: LocatedBlocks\{; fileLength=22957603;
underConstruction=false;
blocks=[LocatedStripedBlock{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
getBlockSize()=22957603; corrupt=false; offset=0;
locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
indices=[0, 1, 2, 3, 4]}];
lastLocatedBlock=LocatedStripedBlock\{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
getBlockSize()=22957603; corrupt=false; offset=0;
locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
indices=[0, 1, 2, 3, 4]}; isLastBlockComplete=true;
ecPolicy=ErasureCodingPolicy=[Name=RS-3-2-1024k, Schema=[ECSchema=[Codec=rs,
numDataUnits=3, numParityUnits=2]], CellSize=1048576, Id=2]}
at
org.apache.hadoop.hdfs.StripeReader.checkMissingBlocks(StripeReader.java:177)
at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:372)
at
org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:324)
at
org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:397)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
at java.io.DataInputStream.read(DataInputStream.java:149)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:765)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1562)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:931)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.isNextBlock(HFileReaderImpl.java:1064)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.positionForNextBlock(HFileReaderImpl.java:1058)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1076)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208)
... 13 more
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]