[ https://issues.apache.org/jira/browse/HDFS-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jack Yang updated HDFS-17608: ----------------------------- Description: The blocks on the decommissioning datanode are all EC striped block. The decommissioning progress hangs forever and keeping output these logs: 2024-08-26 10:31:14,748 WARN datanode.DataNode (DataNode.java:run(2927)) - DatanodeRegistration(10.18.130.251:1019, datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe curePort=0, ipcPort=8010, storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed to transfer BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to x.x.x.x:1019 got java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2024-08-26 10:31:14,758 WARN datanode.DataNode (BlockSender.java:readChecksum(693)) - Could not read or failed to verify checksum for data at offset 10878976 for block BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731 java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) The namenode outputs: 2024-08-26 10:39:13,404 INFO BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 10.18.130.251:1019, Is current datanode decommissioning: true, Is current datanode entering maintenance: false 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still has 3 blocks to replicate before it is a candidate to finish Decommission In Progress. 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes this tick. 1 nodes are now in maintenance or transitioning state. 0 nodes pending. The block (blk_-9223372036501683307_35436379) that datanode is trying to access is in the disk which has media error. The dmesg keeps saying: [Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium Error [current] [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No additional sense information [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 00 00 03 06 09 e3 b0 00 00 00 08 00 00 When I try to cp this block file, i get error: cp blk_-9223372036827280880_3990731.meta /opt cp: error reading 'blk_-9223372036827280880_3990731.meta': Input/output error was: The blocks on the decommissioning datanode are all EC striped block. The decommissioning progress hangs forever and keeping output these logs: 2024-08-26 10:31:14,748 WARN datanode.DataNode (DataNode.java:run(2927)) - DatanodeRegistration(10.18.130.251:1019, datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe curePort=0, ipcPort=8010, storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed to transfer BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to x.x.x.x:1019 got java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2024-08-26 10:31:14,758 WARN datanode.DataNode (BlockSender.java:readChecksum(693)) - Could not read or failed to verify checksum for data at offset 10878976 for block BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731 java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) The namenode outputs: 2024-08-26 10:39:13,404 INFO BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 10.18.130.251:1019, Is current datanode decommissioning: true, Is current datanode entering maintenance: false 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still has 3 blocks to replicate before it is a candidate to finish Decommission In Progress. 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes this tick. 1 nodes are now in maintenance or transitioning state. 0 nodes pending. The block (blk_-9223372036501683307_35436379) that datanode is trying to access is in the disk which has media error. The dmesg keeps saying: [Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium Error [current] [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No additional sense information [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 00 00 03 06 09 e3 b0 00 00 00 08 00 00 > Datanodes Decommissioning hang forever if the node under decommissioning has > disk media error > --------------------------------------------------------------------------------------------- > > Key: HDFS-17608 > URL: https://issues.apache.org/jira/browse/HDFS-17608 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.3.6 > Environment: Redhat 8.7, Hadoop 3.3.6 > Reporter: Jack Yang > Priority: Major > Attachments: image-2024-08-26-10-37-27-359.png > > > The blocks on the decommissioning datanode are all EC striped block. The > decommissioning progress hangs forever and keeping output these logs: > > 2024-08-26 10:31:14,748 WARN datanode.DataNode (DataNode.java:run(2927)) - > DatanodeRegistration(10.18.130.251:1019, > datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe > curePort=0, ipcPort=8010, > storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed > to transfer > BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to > x.x.x.x:1019 got > java.io.IOException: Input/output error > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > 2024-08-26 10:31:14,758 WARN datanode.DataNode > (BlockSender.java:readChecksum(693)) - Could not read or failed to verify > checksum for data at offset 10878976 for block > BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731 > java.io.IOException: Input/output error > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > The namenode outputs: > 2024-08-26 10:39:13,404 INFO BlockStateChange > (DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: > blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, > maintenance replicas: 0, live entering maintenance replicas: 0, replicas on > stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: > false, Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 > 10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 > 10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: > 10.18.130.251:1019, Is current datanode decommissioning: true, Is current > datanode entering maintenance: false > 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor > (DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still > has 3 blocks to replicate before it is a candidate to finish Decommission In > Progress. > 2024-08-26 10:39:13,404 INFO blockmanagement.DatanodeAdminDefaultMonitor > (DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes > this tick. 1 nodes are now in maintenance or transitioning state. 0 nodes > pending. > The block (blk_-9223372036501683307_35436379) that datanode is trying to > access is in the disk which has media error. The dmesg keeps saying: > [Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector > 12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat > 0x2d, data len requested/completed 0x1000/0x0 > [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat > 0x2d, data len requested/completed 0x1000/0x0 > [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat > 0x2d, data len requested/completed 0x1000/0x0 > [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s > [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium > Error [current] > [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No > additional sense information > [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 > 00 00 03 06 09 e3 b0 00 00 00 08 00 00 > > When I try to cp this block file, i get error: > cp blk_-9223372036827280880_3990731.meta /opt > cp: error reading 'blk_-9223372036827280880_3990731.meta': Input/output > error -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org