[ 
https://issues.apache.org/jira/browse/HDFS-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Yang updated HDFS-17608:
-----------------------------
    Description: 
The blocks on the decommissioning datanode are all EC striped block. The 
decommissioning progress hangs forever and keeping output these logs:

 

2024-08-26 10:31:14,748 WARN  datanode.DataNode (DataNode.java:run(2927)) - 
DatanodeRegistration(10.18.130.251:1019, 
datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe
curePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed
 to transfer 
BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to 
x.x.x.x:1019 got
java.io.IOException: Input/output error
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
2024-08-26 10:31:14,758 WARN  datanode.DataNode 
(BlockSender.java:readChecksum(693)) -  Could not read or failed to verify 
checksum for data at offset 10878976 for block 
BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731
java.io.IOException: Input/output error
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

The namenode outputs:

2024-08-26 10:39:13,404 INFO  BlockStateChange 
(DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: 
blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, 
corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, 
Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 
10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 
10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 
10.18.130.251:1019, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
(DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still 
has 3 blocks to replicate before it is a candidate to finish Decommission In 
Progress.
2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
(DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes this 
tick. 1 nodes are now in maintenance or transitioning state. 0 nodes pending.

The block (blk_-9223372036501683307_35436379) that datanode is trying to access 
is in the disk which has media error.  The dmesg keeps saying:

[Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 
12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium Error 
[current] 
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No additional 
sense information
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 00 
00 03 06 09 e3 b0 00 00 00 08 00 00

 

When I try to cp this block file, i get error:

  cp  blk_-9223372036827280880_3990731.meta /opt
  cp: error reading 'blk_-9223372036827280880_3990731.meta': Input/output error

  was:
The blocks on the decommissioning datanode are all EC striped block. The 
decommissioning progress hangs forever and keeping output these logs:

 

2024-08-26 10:31:14,748 WARN  datanode.DataNode (DataNode.java:run(2927)) - 
DatanodeRegistration(10.18.130.251:1019, 
datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe
curePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed
 to transfer 
BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to 
x.x.x.x:1019 got
java.io.IOException: Input/output error
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
2024-08-26 10:31:14,758 WARN  datanode.DataNode 
(BlockSender.java:readChecksum(693)) -  Could not read or failed to verify 
checksum for data at offset 10878976 for block 
BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731
java.io.IOException: Input/output error
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

The namenode outputs:

2024-08-26 10:39:13,404 INFO  BlockStateChange 
(DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: 
blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, 
corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, 
Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 
10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 
10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 
10.18.130.251:1019, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
(DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still 
has 3 blocks to replicate before it is a candidate to finish Decommission In 
Progress.
2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
(DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes this 
tick. 1 nodes are now in maintenance or transitioning state. 0 nodes pending.

The block (blk_-9223372036501683307_35436379) that datanode is trying to access 
is in the disk which has media error.  The dmesg keeps saying:

[Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 
12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 0x2d, 
data len requested/completed 0x1000/0x0
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium Error 
[current] 
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No additional 
sense information
[Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 00 
00 03 06 09 e3 b0 00 00 00 08 00 00


> Datanodes Decommissioning hang forever if the node under decommissioning has 
> disk media error
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17608
>                 URL: https://issues.apache.org/jira/browse/HDFS-17608
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.3.6
>         Environment: Redhat 8.7, Hadoop 3.3.6
>            Reporter: Jack Yang
>            Priority: Major
>         Attachments: image-2024-08-26-10-37-27-359.png
>
>
> The blocks on the decommissioning datanode are all EC striped block. The 
> decommissioning progress hangs forever and keeping output these logs:
>  
> 2024-08-26 10:31:14,748 WARN  datanode.DataNode (DataNode.java:run(2927)) - 
> DatanodeRegistration(10.18.130.251:1019, 
> datanodeUuid=a9e27f77-eb6e-46df-ad4c-b5daf2bf9508, infoPort=1022, infoSe
> curePort=0, ipcPort=8010, 
> storageInfo=lv=-57;cid=CID-75a4da17-d28b-4820-b781-7c9f8dced67f;nsid=2079136093;c=1692354715862):Failed
>  to transfer 
> BP-184818459-10.18.130.160-1692354715862:blk_-9223372036501683307_35436379 to 
> x.x.x.x:1019 got
> java.io.IOException: Input/output error
>         at java.io.FileInputStream.readBytes(Native Method)
>         at java.io.FileInputStream.read(FileInputStream.java:255)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> 2024-08-26 10:31:14,758 WARN  datanode.DataNode 
> (BlockSender.java:readChecksum(693)) -  Could not read or failed to verify 
> checksum for data at offset 10878976 for block 
> BP-184818459-x.x.x.x-1692354715862:blk_-9223372036827280880_3990731
> java.io.IOException: Input/output error
>         at java.io.FileInputStream.readBytes(Native Method)
>         at java.io.FileInputStream.read(FileInputStream.java:255)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:881)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:215)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:691)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:578)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:816)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:763)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2900)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> The namenode outputs:
> 2024-08-26 10:39:13,404 INFO  BlockStateChange 
> (DatanodeAdminManager.java:logBlockReplicationInfo(373)) - Block: 
> blk_-9223372036823520640_4252147, Expected Replicas: 9, live replicas: 8, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
> maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
> stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: 
> false, Datanodes having this block: 10.18.130.68:1019 10.18.130.52:1019 
> 10.18.129.137:1019 10.18.130.65:1019 10.18.129.150:1019 10.18.130.58:1019 
> 10.18.137.12:1019 10.18.130.251:1019 10.18.129.171:1019 , Current Datanode: 
> 10.18.130.251:1019, Is current datanode decommissioning: true, Is current 
> datanode entering maintenance: false
> 2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
> (DatanodeAdminDefaultMonitor.java:check(305)) - Node 10.18.130.251:1019 still 
> has 3 blocks to replicate before it is a candidate to finish Decommission In 
> Progress.
> 2024-08-26 10:39:13,404 INFO  blockmanagement.DatanodeAdminDefaultMonitor 
> (DatanodeAdminDefaultMonitor.java:run(188)) - Checked 3 blocks and 1 nodes 
> this tick. 1 nodes are now in maintenance or transitioning state. 0 nodes 
> pending.
> The block (blk_-9223372036501683307_35436379) that datanode is trying to 
> access is in the disk which has media error.  The dmesg keeps saying:
> [Mon Aug 26 10:41:28 2024] blk_update_request: I/O error, dev sdk, sector 
> 12816298864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#489 BRCM Debug mfi stat 
> 0x2d, data len requested/completed 0x1000/0x0
> [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#491 BRCM Debug mfi stat 
> 0x2d, data len requested/completed 0x1000/0x0
> [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 BRCM Debug mfi stat 
> 0x2d, data len requested/completed 0x1000/0x0
> [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 FAILED Result: 
> hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
> [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Sense Key : Medium 
> Error [current] 
> [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 Add. Sense: No 
> additional sense information
> [Mon Aug 26 10:41:28 2024] sd 0:2:10:0: [sdk] tag#493 CDB: Read(16) 88 00 00 
> 00 00 03 06 09 e3 b0 00 00 00 08 00 00
>  
> When I try to cp this block file, i get error:
>   cp  blk_-9223372036827280880_3990731.meta /opt
>   cp: error reading 'blk_-9223372036827280880_3990731.meta': Input/output 
> error



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to