[ https://issues.apache.org/jira/browse/HDFS-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219398#comment-17219398 ]
Ahmed Hussein commented on HDFS-15461: -------------------------------------- The Junit test fails because one of the DN will be stressed "Too Busy".. The number of replications is set to 3, which is the same number of Datanodes When DN[0] is shutdown, the Junit fails to replace it. I found that there was a previous Jira HDFS-13833 (Improve BlockPlacementPolicyDefault's consider load logic), that was discussing corner case dealing with {{considerLoad}}. [~weichiu], Do you think that the DN being {{NODE_TOO_BUSY=1}} is related to the old jira (HDFS-13833)? Nevertheless, I created a hack fix in [PR-2404|https://github.com/apache/hadoop/pull/2404]. {code:bash} 2020-10-16 21:57:45,026 [IPC Server handler 6 on default port 39149] INFO namenode.FSNamesystem (FSNamesystem.java:updatePipeline(5851)) - updatePipeline(blk_1073741832_1031, newGS=1032, newLength=1012, newNodes=[127.0.0.1:38023, 127.0.0.1:43295, 127.0.0.1:39597], client=DFSClient_NONMAPREDUCE_-1562239120_1) 2020-10-16 21:57:45,026 [IPC Server handler 6 on default port 39149] INFO namenode.FSNamesystem (FSNamesystem.java:updatePipeline(5869)) - updatePipeline(blk_1073741832_1031 => blk_1073741832_1032) success 2020-10-16 21:57:45,032 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=LAST_IN_PIPELINE] INFO DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1539)) - src: /127.0.0.1:54924, dest: /127.0.0.1:39597, bytes: 1024, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1562239120_1, offset: 0, srvID: fba1416b-f599-43b9-9c6d-51618f14e51a, blockid: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, duration(ns): 6521496 2020-10-16 21:57:45,032 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=LAST_IN_PIPELINE] INFO datanode.DataNode (BlockReceiver.java:run(1512)) - PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=LAST_IN_PIPELINE terminating 2020-10-16 21:57:45,032 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[127.0.0.1:39597]] INFO DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1539)) - src: /127.0.0.1:41768, dest: /127.0.0.1:43295, bytes: 1024, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1562239120_1, offset: 0, srvID: bf1c5a07-31d5-407b-820e-00c4cf242494, blockid: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, duration(ns): 6943537 2020-10-16 21:57:45,033 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[127.0.0.1:39597]] INFO datanode.DataNode (BlockReceiver.java:run(1512)) - PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[127.0.0.1:39597] terminating 2020-10-16 21:57:45,033 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[127.0.0.1:43295, 127.0.0.1:39597]] INFO DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1539)) - src: /127.0.0.1:56942, dest: /127.0.0.1:38023, bytes: 1024, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1562239120_1, offset: 0, srvID: 62117e32-87c2-4374-9de3-916b9ba9cd24, blockid: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, duration(ns): 7664684 2020-10-16 21:57:45,033 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[127.0.0.1:43295, 127.0.0.1:39597]] INFO datanode.DataNode (BlockReceiver.java:run(1512)) - PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741832_1032, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[127.0.0.1:43295, 127.0.0.1:39597] terminating 2020-10-16 21:57:45,036 [IPC Server handler 9 on default port 39149] INFO blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseRandom(891)) - Not enough replicas was chosen. Reason: {NODE_TOO_BUSY=1} 2020-10-16 21:57:45,037 [IPC Server handler 9 on default port 39149] WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(471)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2020-10-16 21:57:45,037 [IPC Server handler 9 on default port 39149] WARN protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2020-10-16 21:57:45,037 [IPC Server handler 9 on default port 39149] WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(471)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2020-10-16 21:57:45,038 [IPC Server handler 9 on default port 39149] INFO hdfs.StateChange (FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate blk_1073741833_1033, replicas=127.0.0.1:39597, 127.0.0.1:43295 for /bar 2020-10-16 21:57:45,040 [DataXceiver for client DFSClient_NONMAPREDUCE_-1562239120_1 at /127.0.0.1:54928 [Receiving block BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033]] INFO datanode.DataNode (DataXceiver.java:writeBlock(747)) - Receiving BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033 src: /127.0.0.1:54928 dest: /127.0.0.1:39597 2020-10-16 21:57:45,041 [DataXceiver for client DFSClient_NONMAPREDUCE_-1562239120_1 at /127.0.0.1:41778 [Receiving block BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033]] INFO datanode.DataNode (DataXceiver.java:writeBlock(747)) - Receiving BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033 src: /127.0.0.1:41778 dest: /127.0.0.1:43295 2020-10-16 21:57:45,045 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, type=LAST_IN_PIPELINE] INFO DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1539)) - src: /127.0.0.1:41778, dest: /127.0.0.1:43295, bytes: 243, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1562239120_1, offset: 0, srvID: bf1c5a07-31d5-407b-820e-00c4cf242494, blockid: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, duration(ns): 2163102 2020-10-16 21:57:45,045 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, type=LAST_IN_PIPELINE] INFO datanode.DataNode (BlockReceiver.java:run(1512)) - PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, type=LAST_IN_PIPELINE terminating 2020-10-16 21:57:45,046 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[127.0.0.1:43295]] INFO DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1539)) - src: /127.0.0.1:54928, dest: /127.0.0.1:39597, bytes: 243, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1562239120_1, offset: 0, srvID: fba1416b-f599-43b9-9c6d-51618f14e51a, blockid: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, duration(ns): 2876353 2020-10-16 21:57:45,046 [PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[127.0.0.1:43295]] INFO datanode.DataNode (BlockReceiver.java:run(1512)) - PacketResponder: BP-1142769077-172.17.0.2-1602885461097:blk_1073741833_1033, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[127.0.0.1:43295] terminating 2020-10-16 21:57:45,047 [IPC Server handler 2 on default port 39149] INFO hdfs.StateChange (FSNamesystem.java:completeFile(3164)) - DIR* completeFile: /bar is closed by DFSClient_NONMAPREDUCE_-1562239120_1 2020-10-16 21:57:45,049 [IPC Server handler 3 on default port 39149] INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(8708)) - allowed=true ugi=jenkins (auth:SIMPLE) ip=/127.0.0.1 cmd=open src=/bar dst=null perm=null proto=rpc 2020-10-16 21:57:45,056 [IPC Server handler 4 on default port 39149] INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(8708)) - allowed=true ugi=jenkins (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/bar dst=null perm=null proto=rpc 2020-10-16 21:57:45,058 [IPC Server handler 6 on default port 39149] INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(8708)) - allowed=true ugi=jenkins (auth:SIMPLE) ip=/127.0.0.1 cmd=append src=/bar dst=null perm=null proto=rpc 2020-10-16 21:57:45,068 [IPC Server handler 5 on default port 39149] INFO blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseRandom(891)) - Not enough replicas was chosen. Reason: {NODE_TOO_BUSY=1} 2020-10-16 21:57:45,068 [IPC Server handler 5 on default port 39149] WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(471)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2020-10-16 21:57:45,069 [IPC Server handler 5 on default port 39149] WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(471)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2020-10-16 21:57:45,069 [IPC Server handler 5 on default port 39149] WARN protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2020-10-16 21:57:45,069 [IPC Server handler 5 on default port 39149] WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(471)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All required storage types are unavailable: unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2020-10-16 21:57:45,074 [Thread-851] WARN hdfs.DataStreamer (DataStreamer.java:run(825)) - DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[127.0.0.1:43295,DS-375c3e68-86ca-46bd-b95c-a6ae4665fb64,DISK], DatanodeInfoWithStorage[127.0.0.1:39597,DS-67dbab12-9ea4-490f-89a1-7e6d1bb2fcca,DISK]], original=[DatanodeInfoWithStorage[127.0.0.1:43295,DS-375c3e68-86ca-46bd-b95c-a6ae4665fb64,DISK], DatanodeInfoWithStorage[127.0.0.1:39597,DS-67dbab12-9ea4-490f-89a1-7e6d1bb2fcca,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304) at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372) at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:719) 2020-10-16 21:57:45,076 [Listener at localhost/33255] INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(2091)) - Shutting down the Mini HDFS Cluster 2020-10-16 21:57:45,077 [Listener at localhost/33255] INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdownDataNode(2139)) - Shutting down DataNode 2 2020-10-16 21:57:45,077 [Listener at localhost/33255] INFO datanode.DirectoryScanner (DirectoryScanner.java:shutdown(392)) - Shutdown has been called 2020-10-16 21:57:45,077 [org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@692fd26] INFO datanode.DataNode (DataXceiverServer.java:closeAllPeers(396)) - Closing all peers. 2020-10-16 21:57:45,078 [VolumeScannerThread(/home/jenkins/jenkins-agent/workspace/PreCommit-HDFS-Build/sourcedir/hadoop-hdfs-project/hadoop-hdfs/target/test/data/1/dfs/data/data6)] INFO datanode.VolumeScanner (VolumeScanner.java:run(672)) - VolumeScanner(/home/jenkins/jenkins-agent/workspace/PreCommit-HDFS-Build/sourcedir/hadoop-hdfs-project/hadoop-hdfs/target/test/data/1/dfs/data/data6, DS-e527bc67-d8b3-4acd-adfa-20fe97d691ff) exiting. 2020-10-16 21:57:45,078 [VolumeScannerThread(/home/jenkins/jenkins-agent/workspace/PreCommit-HDFS-Build/sourcedir/hadoop-hdfs-project/hadoop-hdfs/target/test/data/1/dfs/data/data5)] INFO datanode.VolumeScanner (VolumeScanner.java:run(672)) - VolumeScanner(/home/jenkins/jenkins-agent/workspace/PreCommit-HDFS-Build/sourcedir/hadoop-hdfs-project/hadoop-hdfs/target/test/data/1/dfs/data/data5, DS-375c3e68-86ca-46bd-b95c-a6ae4665fb64) exiting. 2020-10-16 21:57:45,100 [Listener at localhost/33255] INFO handler.ContextHandler (ContextHandler.java:doStop(1007)) - Stopped o.e.j.w.WebAppContext@1d61c6dc{datanode,/,null,UNAVAILABLE}{file:/home/jenkins/jenkins-agent/workspace/PreCommit-HDFS-Build/sourcedir/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode} 2020-10-16 21:57:45,101 [Listener at localhost/33255] INFO server.AbstractConnector (AbstractConnector.java:doStop(343)) - Stopped ServerConnector@4e31c3ec{HTTP/1.1,[http/1.1]}{localhost:0} 2020-10-16 21:57:45,101 [Listener at localhost/33255] INFO server.session (HouseKeeper.java:stopScavenging(158)) - node0 Stopped scavenging 2020-10-16 21:57:45,102 [Listener at localhost/33255] INFO handler.ContextHandler (ContextHandler.java:doStop(1007)) - Stopped o.e.j.s.ServletContextHandler@627d8516{static,/static,file:///home/jenkins/jenkins-agent/workspace/PreCommit-HDFS-Build/sourcedir/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/,UNAVAILABLE} 2020-10-16 21:57:45,102 [Listener at localhost/33255] INFO handler.ContextHandler (ContextHandler.java:doStop(1007)) - Stopped o.e.j.s.ServletContextHandler@26722665{logs,/logs,file:///home/jenkins/jenkins-agent/workspace/PreCommit-HDFS-Build/sourcedir/hadoop-hdfs-project/hadoop-hdfs/target/log/,UNAVAILABLE} 2020-10-16 21:57:45,103 [Listener at localhost/33255] INFO datanode.DataNode (DataNode.java:shutdown(2104)) - Waiting up to 30 seconds for transfer threads to complete 2020-10-16 21:57:45,103 [Listener at localhost/33255] INFO ipc.Server (Server.java:stop(3400)) - Stopping server on 33255 2020-10-16 21:57:45,107 [IPC Server listener on 0] INFO ipc.Server (Server.java:run(1361)) - Stopping IPC Server listener on 0 2020-10-16 21:57:45,107 [IPC Server Responder] INFO ipc.Server (Server.java:run(1497)) - Stopping IPC Server Responder 2020-10-16 21:57:45,107 [BP-1142769077-172.17.0.2-1602885461097 heartbeating to localhost/127.0.0.1:39149] WARN datanode.IncrementalBlockReportManager (IncrementalBlockReportManager.java:waitTillNextIBR(160)) - IncrementalBlockReportManager interrupted 2020-10-16 21:57:45,107 [Command processor] ERROR datanode.DataNode (BPServiceActor.java:processQueue(1329)) - Command processor encountered interrupt and exit. {code} > TestDFSClientRetries#testGetFileChecksum fails intermittently > ------------------------------------------------------------- > > Key: HDFS-15461 > URL: https://issues.apache.org/jira/browse/HDFS-15461 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Ahmed Hussein > Assignee: Ahmed Hussein > Priority: Major > Labels: pull-request-available, test > Time Spent: 10m > Remaining Estimate: 0h > > {{TestDFSClientRetries.testGetFileChecksum}} fails intermittently on hadoop > trunk > {code:bash} > [INFO] Running org.apache.hadoop.hdfs.TestGetFileChecksum > [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 10.491 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestGetFileChecksum > [ERROR] testGetFileChecksum(org.apache.hadoop.hdfs.TestGetFileChecksum) Time > elapsed: 4.248 s <<< ERROR! > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[127.0.0.1:52468,DS-e35b6720-8ac2-4e5e-98df-306985da6924,DISK], > > DatanodeInfoWithStorage[127.0.0.1:52472,DS-91ec34d5-3f0a-494e-aed6-b01fa0131d8a,DISK]], > > original=[DatanodeInfoWithStorage[127.0.0.1:52472,DS-91ec34d5-3f0a-494e-aed6-b01fa0131d8a,DISK], > > DatanodeInfoWithStorage[127.0.0.1:52468,DS-e35b6720-8ac2-4e5e-98df-306985da6924,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > at > org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304) > at > org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372) > at > org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:719) > [INFO] > [INFO] Results: > [INFO] > [ERROR] Errors: > [ERROR] TestGetFileChecksum.testGetFileChecksum ยป IO Failed to replace a > bad datanode ... > [INFO] > [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0 > [INFO] > [ERROR] There are test failures. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org