[ 
https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251084#comment-17251084
 ] 

Xiaoqiao He commented on HDFS-15732:
------------------------------------

Hi [~lalapala], Thanks for your report.
IMO, I think it is about that DataNode does not update BlockToken keys in time 
because process command `DNA_ACCESSKEYUPDATE` failed or not receive this 
message from Active NameNode. This case is not related with EC feature, it is 
common case for replication cases.
The root cause is that the current mode for BlockToken keys update is PUSH from 
NameNode, rather than PULL from DataNode, And NameNode does not know if ALL 
DataNode update keys successfully when distributed one roll keys. HDFS-13473 
try to improve it to PULL mode, FYI. Thanks.

> EC client will not retry get block token when block token expired  in 
> kerberized cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-15732
>                 URL: https://issues.apache.org/jira/browse/HDFS-15732
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: dfsclient, ec, erasure-coding
>    Affects Versions: 3.1.1
>         Environment: hadoop 3.1.1
> kerberos
> ec RS-3-2-1024k
>            Reporter: gaozhan ding
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> When enable ec policy on hbase, we got some issues. Once block token was 
> expired in datanode side, client side will not identify the InvalidToken 
> error because of the SASL negotiation. As a result, ec client will not do 
> retry by refetch token when create blockreader.  Then the peer datanode was 
> added to DeadNodes, and all calls to function createBlockReader aim at this 
> datanode in current DFSStripedInputStream will consider this datanode  was 
> dead and return false. The finally result is a read failure.
> Some logs :
> hbase regionserver:
> 2020-12-17 10:00:24,291 WARN 
> [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: 
> Failed to connect to /10.65.19.41:9866 for 
> blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>  at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
>  at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
>  at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
>  at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
>  at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
>  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:250)
>  at 
> org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2031)
>  at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2022)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6408)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:6388)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2926)
>  
>  
> datanode log:
> 2020-12-17 10:00:24,290 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hadoop-btdn0005.eniot.io:9866:DataXceiver error processing unknown operation 
> src: /10.65.19.42:53894 dst: /10.65.19.41:9866
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
> token with block_token_identifier (expiryDate=1608192045256, 
> keyId=1449537289, userId=hbase, 
> blockPoolId=BP-1601568648-10.65.19.12-1550823043026, 
> blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, 
> DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, 
> DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, 
> DS-a1ee0117-7430-4181-b279-3f1b16541567, 
> DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, 
> DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.]
>  at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>  at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
> token with block_token_identifier (expiryDate=1608192045256, 
> keyId=1449537289, userId=hbase, 
> blockPoolId=BP-1601568648-10.65.19.12-1550823043026, 
> blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, 
> DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, 
> DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, 
> DS-a1ee0117-7430-4181-b279-3f1b16541567, 
> DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, 
> DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.
>  at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:479)
>  at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:81)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>  at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>  ... 7 more
>  
> final result
>  
>  
> java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for 
> reader 
> reader=hdfs://azbeta/hbase/data/o15427722038191/RAW_569dfc35-10dc-4fe4-bb8c-5c403368d6c1/ccf146fb5c852d751b6b2f8307f81fa0/d/386bdb6978d348f4b4e168e728762842,
>  compression=snappy, cacheConf=blockCache=LruBlockCache\{blockCount=7915, 
> currentSize=489.13 MB, freeSize=2.71 GB, maxSize=3.19 GB, heapSize=489.13 MB, 
> minSize=3.03 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, 
> singleSize=775.81 MB, singleFactor=0.25}, cacheDataOnRead=true, 
> cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, 
> cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, 
> firstKey=Optional[\x01\x00\x00\x13\xBD\x00\x00\x00T\x070\x97\xF7/d:arm_xingneng_test_model@AI1/1603445985858/Put/seqid=0],
>  
> lastKey=Optional[\x01\x00\x00\x16\xF9\x00\x00\x00T\x073-8/d:arm_xingneng_test_model@AI99/1603446144699/Put/seqid=0],
>  avgKeyLen=53, avgValueLen=4, entries=2016557, length=22957603, 
> cur=\x01\x00\x00\x14#\x00\x00\x00T\x072\xCC\xF7/d:arm_xingneng_test_model@AI84/1603446131146/Put/vlen=4/seqid=17230816]
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:217)
>  at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
>  at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:388)
>  at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327)
>  at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
>  at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
>  at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1410)
>  at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2187)
>  at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:596)
>  at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:638)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: 3 missing blocks, the stripe is: 
> AlignedStripe(Offset=1048576, length=1048576, fetchedChunksNum=0, 
> missingChunksNum=3); locatedBlocks is: LocatedBlocks\{; fileLength=22957603; 
> underConstruction=false; 
> blocks=[LocatedStripedBlock{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
>  getBlockSize()=22957603; corrupt=false; offset=0; 
> locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
>  indices=[0, 1, 2, 3, 4]}]; 
> lastLocatedBlock=LocatedStripedBlock\{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
>  getBlockSize()=22957603; corrupt=false; offset=0; 
> locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
>  indices=[0, 1, 2, 3, 4]}; isLastBlockComplete=true; 
> ecPolicy=ErasureCodingPolicy=[Name=RS-3-2-1024k, Schema=[ECSchema=[Codec=rs, 
> numDataUnits=3, numParityUnits=2]], CellSize=1048576, Id=2]}
>  at 
> org.apache.hadoop.hdfs.StripeReader.checkMissingBlocks(StripeReader.java:177)
>  at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:372)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:324)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:397)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:765)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1562)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:931)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.isNextBlock(HFileReaderImpl.java:1064)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.positionForNextBlock(HFileReaderImpl.java:1058)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1076)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208)
>  ... 13 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to