[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-17 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827839#comment-17827839
 ] 

Rushabh Shah commented on HDFS-17299:
-

Merged to branch-2.10 and branch-3.3 also. Thank you [~gargrite] for the patch.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.10.3, 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> a

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-17 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Fix Version/s: 2.10.3

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.10.3, 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.h

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-14 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Fix Version/s: 3.3.7

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.7, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdf

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-14 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Fix Version/s: 3.3.9
   (was: 3.3.7)

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824197#comment-17824197
 ] 

Rushabh Shah commented on HDFS-17299:
-

[~slfan1989] By mistake, I cherry-picked this change to branch-3.4.0. I was not 
aware that you are trying to do an RC from this branch.
This is the commit: 
https://github.com/apache/hadoop/commit/f62f116b75dc999c2d4a1824cf099c34e64e74aa

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus

[jira] [Resolved] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HDFS-17299.
-
Fix Version/s: 3.4.1
   3.5.0
   Resolution: Fixed

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824194#comment-17824194
 ] 

Rushabh Shah commented on HDFS-17299:
-

Thank you [~gargrite]  for the PR.

The patch applied cleanly to trunk, branch-3.4 and branch-3.4.0
There are some conflicts applying it to branch-3.3, branch-3.2 and branch-2.10. 
Can you please create PR for these branches?
Thank you.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
>  

[jira] [Created] (HDFS-17409) TestDFSAdmin has couple of flaky tests.

2024-03-05 Thread Rushabh Shah (Jira)
Rushabh Shah created HDFS-17409:
---

 Summary: TestDFSAdmin has couple of flaky tests.
 Key: HDFS-17409
 URL: https://issues.apache.org/jira/browse/HDFS-17409
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.5.0
Reporter: Rushabh Shah


TestDFSAdmin has couple of flaky tests.
TestDFSAdmin#testAllDatanodesReconfig
TestDFSAdmin#testDecommissionDataNodesReconfig

Seeing failure in couple of separate PRs:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6583/7/#showFailuresLink
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/

testDecommissionDataNodesReconfig is failing locally with the following error:

{noformat}
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at 
org.apache.hadoop.hdfs.tools.TestDFSAdmin.testDecommissionDataNodesReconfig(TestDFSAdmin.java:1356)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas

2024-01-11 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805723#comment-17805723
 ] 

Rushabh Shah commented on HDFS-16064:
-

[~KevinWikant]  By any chance you have draft PR for branch-2.10 saved? 

> Determine when to invalidate corrupt replicas based on number of usable 
> replicas
> 
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.5
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas

2024-01-11 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805704#comment-17805704
 ] 

Rushabh Shah commented on HDFS-16064:
-

Thank you [~KevinWikant]  for your reply. We also hit this bug recently in our 
production cluster. One of my colleague is trying to backport this patch to 
branch-2.10. Let us see if we have any luck backporting this change. 

> Determine when to invalidate corrupt replicas based on number of usable 
> replicas
> 
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.5
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsu

[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas

2024-01-10 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805302#comment-17805302
 ] 

Rushabh Shah commented on HDFS-16064:
-

[~KevinWikant]  [~aajisaka]  Any reason why we haven't backported this fix to 
branch-2.10? 

> Determine when to invalidate corrupt replicas based on number of usable 
> replicas
> 
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.5
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-01-02 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801889#comment-17801889
 ] 

Rushabh Shah commented on HDFS-17299:
-

Thank you [~ayushtkn] 

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.jav

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-01-02 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801867#comment-17801867
 ] 

Rushabh Shah commented on HDFS-17299:
-

[~gargrite] is interested to work on this jira. [~ayushtkn]  [~hexiaoqiao] Can 
one of you please add him to the Contributors list so that I can assign the 
Jira to him. Thank you!

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-27 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Description: 
Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
Our configuration:
1. We use 3 Availability Zones (AZs) for fault tolerance.
2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
3. We use the following configuration parameters: 
dfs.namenode.heartbeat.recheck-interval: 60 
dfs.heartbeat.interval: 3 
So it will take 123 ms (20.5mins) to detect that datanode is dead.

 

Steps to reproduce:
 # Bring down 1 AZ.
 # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush 
on the newly created file.
 # DataStreamer is not able to find blocks locations that satisfies the rack 
placement policy (one copy in each rack which essentially means one copy in 
each AZ)
 # Since all the datanodes in that AZ are down but still alive to namenode, the 
client gets different datanodes but still all of them are in the same AZ. See 
logs below.
 # HBase is not able to create a WAL file and it aborts the region server.

 

Relevant logs from hdfs client and namenode

 
{noformat}
2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
allowed=true  ugi=hbase/ (auth:KERBEROS) ip=  
cmd=create  src=/hbase/WALs/  dst=null

2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652565_140946716, replicas=:50010, 
:50010, :50010 for /hbase/WALs/

2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874--1594838129323:blk_1214652565_140946716
2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]


2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652580_140946764, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]


2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652594_140946796, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796
2023-12-16 17:17:44,732 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK]


2023-12-16 17:17:44,855 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652607_140946850, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,867 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
a

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-27 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800840#comment-17800840
 ] 

Rushabh Shah commented on HDFS-17299:
-

Thank you [~ayushtkn]  [~hexiaoqiao]  [~tasanuma]  for the healthy discussion.

 

>  Excluding a rack in the streamer is quite tricky, we don't know the BPP 
>neither the Cluster Rack configuration during the {{DataStreamer}} setup.

I don't think this is tricky. Like we pass excluded nodes, we have to provide 
excludedRacks argument to the namenode. On the namenode side, in the scope 
parameter we will have to set the scope as "{*}~"{*} The 
negation sign at the start of the argument is important as it will excluded the 
whole rack. We can add some extra protection for if the excluded rack is the 
only rack that is available then ignore the parameter.

 

Since everyone is of the opinion of handling this on the client side, I am fine 
with that too. I don't see any problem with writing and committing the file 
with 2 replicas and let namenode handle the replication later. Let me create 
the patch for review soon. Thank you again !

 

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStre

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-24 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800217#comment-17800217
 ] 

Rushabh Shah commented on HDFS-17299:
-

> Maybe if you would have put 
> dfs.client.block.write.replace-datanode-on-failure.enable as false, it 
> wouldn't have tried to replace the DN itself & went ahead with 2 DN from 
> other AZ?

It is entirely possible that I am not reading the code right. I am little bit 
out of sync with the DataStreamer codebase.
But I don't see this config property 
dfs.client.block.write.replace-datanode-on-failure.enable being used anywhere 
in the PIPELINE_SETUP_CREATE phase.
I am looking at the branch-2.10 branch. This is the code flow. 
[DataStreamer#run()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L708-L711]
 --> 
[nextBlockOutputStream()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1655]
 --> 
[createBlockOutputStream()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1751]
 

There is a retry within nextBlockOutputStream via 
dfs.client.block.write.retries but it doesn't take 
dfs.client.block.write.replace-datanode-on-failure.enable in consideration.
Cc [~ayushtkn] [~hexiaoqiao]

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-22 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799891#comment-17799891
 ] 

Rushabh Shah commented on HDFS-17299:
-

>From the file creation (at 2023-12-16 17:17:43,818) to the DataStreamer giving 
>up (at 2023-12-16 17:17:44,988), it just took little more than 1 second. I 
>doubt keeping stale datanode interval to 9 seconds will help. I still think my 
>proposed fix will help in such cases. [~ayushtkn] WDYT? 

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBl

[jira] [Comment Edited] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-22 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799871#comment-17799871
 ] 

Rushabh Shah edited comment on HDFS-17299 at 12/22/23 4:28 PM:
---

bq.  You should find the reason why the nodes were chosen in 1 AZ only, why no 
node was chosen in other 2 AZ, it will fallback to choosing nodes in 1 AZ(rack) 
only when it fails to spread them to different racks 
 
It is choosing 1 node from each rack. We have replication factor 3 and we have 
3 AZ (i.e. 3 racks) and we use BlockPlacementPolicyRackFaultTolerant as our 
block placement policy. While creating a write pipeline, it will choose 1 node 
from each AZ. But since 1 AZ is down, the datanode chosen in the bad AZ will 
always fail.


bq. Did you explore dfs.namenode.avoid.write.stale.datanode and 
dfs.namenode.stale.datanode.interval. I believe that can bring down your value 
to few seconds, 30 by default but you can get that down to 3*Heartbeat interval 
IIRC.

Haven't explored these configs yet. Thank you for pointing this out. Will take 
a look today. 




was (Author: shahrs87):
bq.  You should find the reason why the nodes were chosen in 1 AZ only, why no 
node was chosen in other 2 AZ, it will fallback to choosing nodes in 1 AZ(rack) 
only when it fails to spread them to different racks 
 
It is choosing 1 node from each rack. We have replication factor 3 and we have 
3 AZ (i.e. 3 racks) and we use BlockPlacementPolicyRackFaultTolerant as our 
block placement policy. While creating a write pipeline, it will choose 1 node 
from each AZ. But since 1 AZ is down, the datanode chosen in the bad AZ will 
always fail.


bq. Did you explore dfs.namenode.avoid.write.stale.datanode and 
dfs.namenode.stale.datanode.interval. I believe that can bring down your value 
to few seconds, 30 by default but you can get that down to 3*Heartbeat interval 
IIRC.

Haven't explored this configs yet. Thank you for pointing this out. Will take a 
look today. 



> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-22 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799871#comment-17799871
 ] 

Rushabh Shah commented on HDFS-17299:
-

bq.  You should find the reason why the nodes were chosen in 1 AZ only, why no 
node was chosen in other 2 AZ, it will fallback to choosing nodes in 1 AZ(rack) 
only when it fails to spread them to different racks 
 
It is choosing 1 node from each rack. We have replication factor 3 and we have 
3 AZ (i.e. 3 racks) and we use BlockPlacementPolicyRackFaultTolerant as our 
block placement policy. While creating a write pipeline, it will choose 1 node 
from each AZ. But since 1 AZ is down, the datanode chosen in the bad AZ will 
always fail.


bq. Did you explore dfs.namenode.avoid.write.stale.datanode and 
dfs.namenode.stale.datanode.interval. I believe that can bring down your value 
to few seconds, 30 by default but you can get that down to 3*Heartbeat interval 
IIRC.

Haven't explored this configs yet. Thank you for pointing this out. Will take a 
look today. 



> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-21 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Description: 
Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
Our configuration:
1. We use 3 Availability Zones (AZs) for fault tolerance.
2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
3. We use the following configuration parameters: 
dfs.namenode.heartbeat.recheck-interval: 60 
dfs.heartbeat.interval: 3 
So it will take 123 ms (20.5mins) to detect that datanode is dead.

 

Steps to reproduce:
 # Bring down 1 AZ.
 # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush 
on the newly created file.
 # DataStreamer is not able to find blocks locations that satisfies the rack 
placement policy (one copy in each rack which essentially means one copy in 
each AZ)
 # Since all the datanodes in that AZ are down but still alive to namenode, the 
client gets different datanodes but still all of them are in the same AZ. See 
logs below.
 # HBase is not able to create a WAL file and it aborts the region server.

 

Relevant logs from hdfs client and namenode

 
{noformat}
2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
allowed=true  ugi=hbase/ (auth:KERBEROS) ip=  
cmd=create  src=/hbase/WALs/  dst=null

2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652565_140946716, replicas=:50010, 
:50010, :50010 for /hbase/WALs/

2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874--1594838129323:blk_1214652565_140946716
2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]


2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652580_140946764, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]


2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652594_140946796, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796
2023-12-16 17:17:44,732 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK]


2023-12-16 17:17:44,855 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652607_140946850, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,867 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
a

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-21 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Description: 
Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
Our configuration:
1. We use 3 Availability Zones (AZs) for fault tolerance.
2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
3. We use the following configuration parameters: 
dfs.namenode.heartbeat.recheck-interval: 60 
dfs.heartbeat.interval: 3 
So it will take 123 ms (20.5mins) to detect that datanode is dead.

 

Steps to reproduce:
 # Bring down 1 AZ.
 # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush 
on the newly created file.
 # DataStreamer is not able to find blocks locations that satisfies the rack 
placement policy (one copy in each rack which essentially means one copy in 
each AZ)
 # Since all the datanodes in that AZ are down but still alive to namenode, the 
client gets different datanodes but still all of them are in the same AZ. See 
logs below.
 # HBase is not able to create a WAL file and it aborts the region server.

 

Relevant logs from hdfs client and namenode

 
{noformat}
2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
allowed=true  ugi=hbase/ (auth:KERBEROS) ip=  
cmd=create  src=/hbase/WALs/  dst=null

2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652565_140946716, replicas=:50010, 
:50010, :50010 for /hbase/WALs/

2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874--1594838129323:blk_1214652565_140946716
2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]


2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652580_140946764, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]


2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652594_140946796, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796
2023-12-16 17:17:44,732 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK]


2023-12-16 17:17:44,855 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652607_140946850, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,867 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
a

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-21 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Description: 
Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
Our configuration:
1. We use 3 Availability Zones (AZs) for fault tolerance.
2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
3. We use the following configuration parameters: 
dfs.namenode.heartbeat.recheck-interval: 60 
dfs.heartbeat.interval: 3 
So it will take 123 ms (20.5mins) to detect that datanode is dead.

 

Steps to reproduce:
 # Bring down 1 AZ.
 # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush 
on the newly created file.
 # DataStreamer is not able to find blocks locations that satisfies the rack 
placement policy (one copy in each rack which essentially means one copy in 
each AZ)
 # Since all the datanodes in that AZ are down but still alive to namenode, the 
client gets different datanodes but still all of them are in the same AZ. See 
logs below.
 # HBase is not able to create a WAL file and it aborts the region server.

 

Relevant logs from hdfs client and namenode

 
{noformat}
2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
allowed=true  ugi=hbase/ (auth:KERBEROS) ip=  
cmd=create  src=/hbase/WALs/  dst=null

2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652565_140946716, replicas=:50010, 
:50010, :50010 for /hbase/WALs/

2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874--1594838129323:blk_1214652565_140946716
2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]


2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652580_140946764, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]


2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652594_140946796, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
2023-12-16 17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796
2023-12-16 17:17:44,732 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
datanode 
DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK]


2023-12-16 17:17:44,855 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652607_140946850, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,867 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStream
java.io.IOException: Got error, status=ERROR, status message , ack with 
firstBadLink as :50010
a

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-21 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Description: 
Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
Our configuration:
1. We use 3 Availability Zones (AZs) for fault tolerance.
2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
3. We use the following configuration parameters: 
dfs.namenode.heartbeat.recheck-interval: 60 
dfs.heartbeat.interval: 3 
So it will take 123 ms (20.5mins) to detect that datanode is dead.

 

Steps to reproduce:
 # Bring down 1 AZ.
 # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush 
on the newly created file.
 # DataStreamer is not able to find blocks locations that satisfies the rack 
placement policy (one copy in each rack which essentially means one copy in 
each AZ)
 # Since all the datanodes in that AZ are down but still alive to namenode, the 
client gets different datanodes but still all of them are in the same AZ. See 
logs below.
 # HBase is not able to create a WAL file and it aborts the region server.

 

Relevant logs from hdfs client and namenode

 
{noformat}
2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
allowed=true  ugi=hbase/ (auth:KERBEROS) ip=  
cmd=create  src=/hbase/WALs/  dst=null

2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652565_140946716, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
message , ack with firstBadLink as :50010        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
        at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)   
2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - 
Abandoning 
BP-179318874--1594838129323:blk_1214652565_1409467162023-12-16 
17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]


2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652580_140946764, replicas=:50010, 
:50010, :50010 for /hbase/WALs/2023-12-16 
17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
message , ack with firstBadLink as :50010        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
        at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
        at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 
17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652580_1409467642023-12-16 
17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]

2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652594_140946796, replicas=:50010, 
:50010, :50010 for /hbase/WALs/2023-12-16 
17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
message , ack with firstBadLink as :50010        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
        at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
        at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 
17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652594_1409467962023-12-16 
17:17:44,732 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK]

2023-12-16 17:17:44,855 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652607_140946850, replicas=:50010, 
:50010, :50010 for /hbase/WALs/2023-12-16 
17:17:44,867 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
message , ack with firstBadLink as :50010     

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-21 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17299:

Priority: Critical  (was: Major)

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {code:java}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
> message , ack with firstBadLink as :50010        at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
>         at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - 
> Abandoning 
> BP-179318874--1594838129323:blk_1214652565_1409467162023-12-16 
> 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/2023-12-16 
> 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
> message , ack with firstBadLink as :50010        at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 
> 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_1409467642023-12-16 
> 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/2023-12-16 
> 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
> message , ack with firstBadLink as :50010        at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 
> 17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-15948

[jira] [Created] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-21 Thread Rushabh Shah (Jira)
Rushabh Shah created HDFS-17299:
---

 Summary: HDFS is not rack failure tolerant while creating a new 
file.
 Key: HDFS-17299
 URL: https://issues.apache.org/jira/browse/HDFS-17299
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.10.1
Reporter: Rushabh Shah


Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
Our configuration:
1. We use 3 Availability Zones (AZs) for fault tolerance.
2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
3. We use the following configuration parameters: 
dfs.namenode.heartbeat.recheck-interval: 60 
dfs.heartbeat.interval: 3 
So it will take 123 ms (20.5mins) to detect that datanode is dead.

 

Steps to reproduce:
 # Bring down 1 AZ.
 # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush 
on the newly created file.
 # DataStreamer is not able to find blocks locations that satisfies the rack 
placement policy (one copy in each rack which essentially means one copy in 
each AZ)
 # Since all the datanodes in that AZ are down but still alive to namenode, the 
client gets different datanodes but still all of them are in the same AZ. See 
logs below.
 # HBase is not able to create a WAL file and it aborts the region server.

 

Relevant logs from hdfs client and namenode

 
{code:java}
2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
allowed=true  ugi=hbase/ (auth:KERBEROS) ip=  
cmd=create  src=/hbase/WALs/  dst=null

2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652565_140946716, replicas=:50010, 
:50010, :50010 for /hbase/WALs/
2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
message , ack with firstBadLink as :50010        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
        at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)   
2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - 
Abandoning 
BP-179318874--1594838129323:blk_1214652565_1409467162023-12-16 
17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]


2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652580_140946764, replicas=:50010, 
:50010, :50010 for /hbase/WALs/2023-12-16 
17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
message , ack with firstBadLink as :50010        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
        at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
        at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 
17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652580_1409467642023-12-16 
17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]

2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652594_140946796, replicas=:50010, 
:50010, :50010 for /hbase/WALs/2023-12-16 
17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status 
message , ack with firstBadLink as :50010        at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
        at 
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
        at 
org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 
17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
BP-179318874-NN-IP-1594838129323:blk_1214652594_1409467962023-12-16 
17:17:44,732 WARN  [Thread-39087] hdfs.DataStreamer - Excluding datanode 
DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK]

2023-12-16 17:17:44,855 INFO  [on default port 9000] hdfs.StateChange - BLOCK* 
allocate blk_1214652607_140946850, replicas=:50010, 
:50010, :50010 for /hbase/WALs/2023-12-16 
17:17:44,867 INFO  

[jira] [Commented] (HDFS-16540) Data locality is lost when DataNode pod restarts in kubernetes

2023-09-29 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770545#comment-17770545
 ] 

Rushabh Shah commented on HDFS-16540:
-

> I'll let folks ask for the backport before doing it for branch-3.3.

Since the patch here fixes the incident that caused data loss in our production 
env, I would prefer we backport the patch to all active branches. [~stack]  
WDYT? I can create a patch if you are busy.

> Data locality is lost when DataNode pod restarts in kubernetes 
> ---
>
> Key: HDFS-16540
> URL: https://issues.apache.org/jira/browse/HDFS-16540
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.2
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> We have HBase RegionServer and Hdfs DataNode running in one pod. When the pod 
> restarts, we found that data locality is lost after we do a major compaction 
> of hbase regions. After some debugging, we found that upon pod restarts, its 
> ip changes. In DatanodeManager, maps like networktopology are updated with 
> the new info. host2DatanodeMap is not updated accordingly. When hdfs client 
> with the new ip tries to find a local DataNode, it fails. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540

2023-09-13 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17188:

Description: 
Recently we saw missing blocks in our production clusters running on dynamic 
environments like AWS. We are running some version of hadoop-2.10 code line.

Events that led to data loss:
 #  We have pool of available IP address and whenever datanode restarts we use 
any available IP address from that pool.
 #  We have seen during the lifetime of namenode process, multiple datanodes 
were restarted and the same datanode has used different IP address.
 # One case that I was debugging was very interesting. 
DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> 
ip-address-3
DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> 
ip-address-1 
Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is 
the first ip address of DN-UUID-1
 #  There was some bug in our operational script which led to all datanodes 
getting restarted at the same time.

Just after the restart, we see the following log lines.
{noformat}
2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.1:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.3:50010
2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.4:50010
{noformat}
This line is logged 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1184].

Snippet below:
{code:java}
      DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid());
      DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr(
          nodeReg.getIpAddr(), nodeReg.getXferPort());
        
      if (nodeN != null && nodeN != nodeS) {
        NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN);
        // nodeN previously served a different data storage, 
        // which is not served by anybody anymore.
        removeDatanode(nodeN);
        // physically remove node from datanodeMap
        wipeDatanode(nodeN);
        nodeN = null;
      } {code}
 

This happens when the DatanodeDescriptor is not the same in datanodeMap and 
host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data 
loss. :)

By filing this jira, I want to discuss following things:
 # Do we really want to call removeDatanode method from namenode whenever any 
such discrepancy in maps is spotted? or  Can we rely on the first full block 
report or the periodic full block report from the datanode to fix the metadata? 
 # Improve logging in the blockmanagement code to debug these issues faster.
 # Add a test case with the exact events that occured in our env and still make 
sure that datanodeMap and host2DatanodeMap are consistent.

  was:
Recently we saw missing blocks in our production clusters running on dynamic 
environments like AWS. We are running some version of hadoop-2.10 code line.

Events that led to data loss:
 #  We have pool of available IP address and whenever datanode restarts we use 
any available IP address from that pool.
 #  We have seen during the lifetime of namenode process, multiple datanodes 
were restarted and the same datanode has used different IP address.
 # One case that I was debugging was very interesting. 
DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> 
ip-address-3
DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> 
ip-address-1 
Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is 
the first ip address of DN-UUID-1
 #  There was some bug in our operational script which led to all datanodes 
getting restarted at the same time.

Just after the restart, we see the following log lines.
{noformat}
2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.1:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.3:50010
2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.4:50010
{noformat}
This line is logged 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/Datanode

[jira] [Updated] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540

2023-09-13 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17188:

Description: 
Recently we saw missing blocks in our production clusters running on dynamic 
environments like AWS. We are running some version of hadoop-2.10 code line.

Events that led to data loss:
 #  We have pool of available IP address and whenever datanode restarts we use 
any available IP address from that pool.
 #  We have seen during the lifetime of namenode process, multiple datanodes 
were restarted and the same datanode has used different IP address.
 # One case that I was debugging was very interesting. 
DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> 
ip-address-3
DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> 
ip-address-1 
Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is 
the first ip address of DN-UUID-1
 #  There was some bug in our operational script which led to all datanodes 
getting restarted at the same time.

Just after the restart, we see the following log lines.
{noformat}
2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.1:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.3:50010
2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.4:50010
{noformat}
This line is logged 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1184].

Snippet below:
{code:java}
      DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid());
      DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr(
          nodeReg.getIpAddr(), nodeReg.getXferPort());
        
      if (nodeN != null && nodeN != nodeS) {
        NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN);
        // nodeN previously served a different data storage, 
        // which is not served by anybody anymore.
        removeDatanode(nodeN);
        // physically remove node from datanodeMap
        wipeDatanode(nodeN);
        nodeN = null;
      } {code}
 

This happens when the DatanodeDescriptor is not the same in datanodeMap and 
host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data 
loss. :)

By filing this jira, I want to discuss following things:
 # Do we really want to call removeDatanode method from namenode whenever any 
such discrepancy in maps is spotted? Can we rely on the first full block report 
or the periodic full block report from the datanode to fix the metadata? 
 # Improve logging in the blockmanagement code to debug these issues faster.
 # Add a test case with the exact events that occured in our env and still make 
sure that datanodeMap and host2DatanodeMap are consistent.

  was:
Recently we saw missing blocks in our production clusters running on dynamic 
environments like AWS. We are running some version of hadoop-2.10 code line.

Events that led to data loss:
 #  We have pool of available IP address and whenever datanode restarts we use 
any available IP address from that pool.
 #  We have seen during the lifetime of namenode process, multiple datanodes 
were restarted and the same datanode has used different IP address.
 # One case that I was debugging was very interesting. 
DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> 
ip-address-3
DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> 
ip-address-1 
Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is 
the first ip address of DN-UUID-1
 #  There was some bug in our operational script which led to all datanodes 
getting restarted at the same time.

Just after the restart, we see the following log lines.
{noformat}
2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.1:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.3:50010
2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.4:50010
{noformat}
This line is logged 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeMana

[jira] [Updated] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540

2023-09-13 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-17188:

Description: 
Recently we saw missing blocks in our production clusters running on dynamic 
environments like AWS. We are running some version of hadoop-2.10 code line.

Events that led to data loss:
 #  We have pool of available IP address and whenever datanode restarts we use 
any available IP address from that pool.
 #  We have seen during the lifetime of namenode process, multiple datanodes 
were restarted and the same datanode has used different IP address.
 # One case that I was debugging was very interesting. 
DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> 
ip-address-3
DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> 
ip-address-1 
Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is 
the first ip address of DN-UUID-1
 #  There was some bug in our operational script which led to all datanodes 
getting restarted at the same time.

Just after the restart, we see the following log lines.
{noformat}
2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.1:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.3:50010
2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.4:50010
{noformat}
This line is logged 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1184].

Snippet below:
{code:java}
      DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid());
      DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr(
          nodeReg.getIpAddr(), nodeReg.getXferPort());
        
      if (nodeN != null && nodeN != nodeS) {
        NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN);
        // nodeN previously served a different data storage, 
        // which is not served by anybody anymore.
        removeDatanode(nodeN);
        // physically remove node from datanodeMap
        wipeDatanode(nodeN);
        nodeN = null;
      } {code}
 

This happens when the DatanodeDescriptor is not the same in datanodeMap and 
host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data 
loss. :)

 

By filing this jira, I want to discuss following things:
 # Do we really want to call removeDatanode method from namenode whenever any 
such discrepancy in maps is spotted? Can we rely on the first full block report 
or the periodic full block report from the datanode to fix the metadata? 
 # Improve logging in the blockmanagement code to debug these issues faster.
 # Add a test case with the exact events that occured in our env and still make 
sure that datanodeMap and host2DatanodeMap are consistent.

  was:
Recently we saw missing blocks in our production clusters running on dynamic 
environments like AWS. We are running some version of hadoop-2.10 code line.

Events that led to data loss:
 #  We have pool of available IP address and whenever datanode restarts we use 
any available IP address from that pool.
 #  We have seen during the lifetime of namenode process, multiple datanodes 
were restarted and the same datanode has used different IP address.
 # One case that I was debugging was very interesting. 
DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> 
ip-address-3
DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> 
ip-address-1 
Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is 
the first ip address of DN-UUID-1
 #  There was some bug in our operational script which led to all datanodes 
getting restarted at the same time.

Just after the restart, we see the following log lines.
{noformat}
2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.1:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.3:50010
2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.4:50010
{noformat}
This line is logged [here|#L1184].]

Snippet below:
{code:java}
      DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid());
      DatanodeDescriptor nodeN = ho

[jira] [Created] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540

2023-09-13 Thread Rushabh Shah (Jira)
Rushabh Shah created HDFS-17188:
---

 Summary: Data loss in our production clusters due to missing 
HDFS-16540 
 Key: HDFS-17188
 URL: https://issues.apache.org/jira/browse/HDFS-17188
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.10.1
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Recently we saw missing blocks in our production clusters running on dynamic 
environments like AWS. We are running some version of hadoop-2.10 code line.

Events that led to data loss:
 #  We have pool of available IP address and whenever datanode restarts we use 
any available IP address from that pool.
 #  We have seen during the lifetime of namenode process, multiple datanodes 
were restarted and the same datanode has used different IP address.
 # One case that I was debugging was very interesting. 
DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> 
ip-address-3
DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> 
ip-address-1 
Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is 
the first ip address of DN-UUID-1
 #  There was some bug in our operational script which led to all datanodes 
getting restarted at the same time.

Just after the restart, we see the following log lines.
{noformat}
2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.1:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.2:50010
2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.3:50010
2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* 
registerDatanode: 10.x.x.4:50010
{noformat}
This line is logged [here|#L1184].]

Snippet below:
{code:java}
      DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid());
      DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr(
          nodeReg.getIpAddr(), nodeReg.getXferPort());
        
      if (nodeN != null && nodeN != nodeS) {
        NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN);
        // nodeN previously served a different data storage, 
        // which is not served by anybody anymore.
        removeDatanode(nodeN);
        // physically remove node from datanodeMap
        wipeDatanode(nodeN);
        nodeN = null;
      } {code}
 

This happens when the DatanodeDescriptor is not the same in datanodeMap and 
host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data 
loss. :)

 

By filing this jira, I want to discuss following things:
 # Do we really want to call removeDatanode method from namenode whenever any 
such discrepancy in maps is spotted? Can we rely on the first full block report 
or the periodic full block report from the datanode to fix the metadata? 
 # Improve logging in the blockmanagement code to debug these issues faster.
 # Add a test case with the exact events that occured in our env and still make 
sure that datanodeMap and host2DatanodeMap are consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-04 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629131#comment-17629131
 ] 

Rushabh Shah commented on HDFS-16829:
-

>  If I catch correct if you had that patch in, Then write would have failed 
> and you would be still safe from the data loss atleast



If we had that patch from HDFS-16146 in, then it would have added a new 
datanode in the pipeline and the write might have succeeded. I don't fully 
understand what state was nodeA in? Because when we add a new node to the 
pipeline, the existing node has to copy the block to the new node. If the first 
node is actually sick and was restarting, I don't know how that datanode would 
handled this scenario.

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -  Logs from nodeB -
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> 

[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-04 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629109#comment-17629109
 ] 

Rushabh Shah commented on HDFS-16829:
-

> That is my point like in your case the pipeline landed up having one 1 
> datanode and that datanode had some issues.

We use the default value i.e. 0. So it will always throw IOException if it is 
unable to find a datanode. In our case, it didn't even attempt to get an 
additional datanode since the pipeline was in  PIPELINE_CLOSE or 
PIPELINE_CLOSE_RECOVERY stage and we didn't have HDFS-16146 in our production 
environment. So it bailed out before even attempting to get an additional 
datanode. [~ayushtkn] 

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -  Logs from nodeB -
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
>  
> {noformat}
>

[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-04 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628011#comment-17628011
 ] 

Rushabh Shah edited comment on HDFS-16829 at 11/4/22 3:48 PM:
--

{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.
{quote}
-I think changing this value will work.- This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492].
 But before that there used to be a check (in 2.10 branch) which will not add a 
node to pipeline if pipeline stage is PIPELINE_CLOSE or 
PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146   We don't have 
HDFS-16146 in our production environment.


was (Author: shahrs87):
{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.
{quote}
I think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492].
 But before that there used to be a check (in 2.10 branch) which will not add a 
node to pipeline if pipeline stage is PIPELINE_CLOSE or 
PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146   We don't have 
HDFS-16146 in our production environment.

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> ne

[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-04 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629057#comment-17629057
 ] 

Rushabh Shah commented on HDFS-16829:
-

{quote}

dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it 
doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.

I think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]

{quote}

Actually this config *won't* work. This config property is _only_ used if we 
are unable to find any good datanode. If we still have 
dfs.client.block.write.replace-datanode-on-failure.min-replication number of 
datanodes in the pipeline, the client will go ahead and continue writing the 
data otherwise it will throw an exception. So I think this Jira will still 
provide value with some extra disk space. 

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -  Logs from nodeB -
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_D

[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-04 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629057#comment-17629057
 ] 

Rushabh Shah edited comment on HDFS-16829 at 11/4/22 3:47 PM:
--

{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.

I think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]
{quote}
Actually this config *won't* work. This config property is _only_ used if we 
are unable to find any good datanode. If we still have 
dfs.client.block.write.replace-datanode-on-failure.min-replication number of 
datanodes in the pipeline, the client will go ahead and continue writing the 
data otherwise it will throw an exception. So I think this Jira will still 
provide value with some extra disk space. [~ayushtkn] 


was (Author: shahrs87):
{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.

I think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]
{quote}
Actually this config *won't* work. This config property is _only_ used if we 
are unable to find any good datanode. If we still have 
dfs.client.block.write.replace-datanode-on-failure.min-replication number of 
datanodes in the pipeline, the client will go ahead and continue writing the 
data otherwise it will throw an exception. So I think this Jira will still 
provide value with some extra disk space. 

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for

[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-04 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629057#comment-17629057
 ] 

Rushabh Shah edited comment on HDFS-16829 at 11/4/22 3:47 PM:
--

{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.

I think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]
{quote}
Actually this config *won't* work. This config property is _only_ used if we 
are unable to find any good datanode. If we still have 
dfs.client.block.write.replace-datanode-on-failure.min-replication number of 
datanodes in the pipeline, the client will go ahead and continue writing the 
data otherwise it will throw an exception. So I think this Jira will still 
provide value with some extra disk space. 


was (Author: shahrs87):
{quote}

dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it 
doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.

I think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]

{quote}

Actually this config *won't* work. This config property is _only_ used if we 
are unable to find any good datanode. If we still have 
dfs.client.block.write.replace-datanode-on-failure.min-replication number of 
datanodes in the pipeline, the client will go ahead and continue writing the 
data otherwise it will throw an exception. So I think this Jira will still 
provide value with some extra disk space. 

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022

[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-02 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628011#comment-17628011
 ] 

Rushabh Shah edited comment on HDFS-16829 at 11/2/22 11:07 PM:
---

{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.
{quote}
I think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492].
 But before that there used to be a check (in 2.10 branch) which will not add a 
node to pipeline if pipeline stage is PIPELINE_CLOSE or 
PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146   We don't have 
HDFS-16146 in our production environment.


was (Author: shahrs87):
{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.
{quote}
I don't think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492].
 But before that there is a check (in 2.10 branch) which will not add a node to 
pipeline if pipeline stage is PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY. This 
check was removed by HDFS-16146    We don't have HDFS-16146 in our production 
environment.

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> new

[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-02 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628011#comment-17628011
 ] 

Rushabh Shah commented on HDFS-16829:
-

{quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if 
it doesn't behave that way, good to have a new config with such behaviour 
semantics
Thank you for pointing out this config. Let me check this out.
{quote}
I don't think changing this value will work. This config property is used 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492].
 But before that there is a check (in 2.10 branch) which will not add a node to 
pipeline if pipeline stage is PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY. This 
check was removed by HDFS-16146    We don't have HDFS-16146 in our production 
environment.

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -  Logs from nodeB -
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - D

[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-01 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627294#comment-17627294
 ] 

Rushabh Shah commented on HDFS-16829:
-

Thank you [~ayushsaxena] for the response.

> What if the client still pushed more data to the single datanode? The data 
> contained by the other datanodes won't be complete right? 
The proposal is to keep the blocks on disk so some operator can manually 
restore them. We don't want to make the replica with old genstamp primary 
replica automatically.

> Moreover with this in genuine cases also you would be occupying the memory 
> with blocks with old genstamps, if the cluster is really unstable and has a 
> lot of updatePipelines or things like that, it may be a issue in that case
IMHO this shouldn't be an issue. We have ReplicationMonitor thread which works 
pretty well. Even in case of many update pipelines, ReplicationMonitor is fast 
enough to replicate the blocks to maintain replication factor. The proposal is 
just to wait deleting the older genstamp blocks until ReplicationMonitor 
replicates that block. If you are concerned about this behavior then we can 
introduce a config key which will default to current behavior and if enabled it 
will delay until it is fully replicated.

> You can get something if you figure out really quick before the configured 
> time after which the block gets deleted. 
In this case, the replica with old genstamp was deleted within 3 seconds after 
the file was closed. One needs to be super human to catch this in production. :)

> dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it 
> doesn't behave that way, good to have a new config with such behaviour 
> semantics
Thank you for pointing out this config. Let me check this out.

> I think if there is one node only, we can give it to try that we have 
>syncBlock always true in such cases, might help or  may be not if the Disk has 
>some issues in some cases...

Given that I don't have logs either from datanode or hdfs client, I am not able 
to understand what exactly happened? :(

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
>

[jira] [Updated] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-10-31 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-16829:

Priority: Critical  (was: Major)

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -  Logs from nodeB -
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
>  
> {noformat}
> -  Logs from nodeC -
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeB:56486
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=LAST_IN_PIPELINE terminating
> 2022-10-23 12:36:38,891 INFO  [/data-1/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk

[jira] [Updated] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-10-31 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-16829:

Description: 
We encountered this data loss issue in one of our production clusters which 
runs hbase service. We received a missing block alert in this cluster. This 
error was logged in the datanode holding the block.
{noformat}
2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
/nodeA:31722 dst: 
java.io.IOException:  Offset 64410559 and length 4096 don't match block 
BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
at java.lang.Thread.run(Thread.java:750)
{noformat}

The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
59158528 but the length of this block according to namenode is 64414655 
(according to fsck)

This are the sequence of events for this block.

 
1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
genstamp: 2244173147. 
2. The first datanode in the pipeline (This physical host was also running 
region server process which was hdfs client) was restarting at the same time. 
Unfortunately this node was sick and it didn't log anything neither in datanode 
process or regionserver process during the time of block creation.
3. Namenode updated the pipeline just with the first node.
4. Namenode logged updatePipeline success with just 1st node nodeA with block 
size: 64414655 and new generation stamp: 2244173222
5. Namenode asked nodeB and nodeC to delete the block since it has old 
generation stamp.
6. All the reads (client reads and data transfer reads) from nodeA are failing 
with the above stack trace.
See logs below from namenode and nodeB and nodeC.


{noformat}
 Logs from namenode  -
2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - BLOCK* 
allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 , 
nodeC:51010 for 
2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
updatePipeline(blk_3317546151_2244173147, newGS=2244173222, newLength=64414655, 
newNodes=[nodeA:51010], client=DFSClient_NONMAPREDUCE_1038417265_1)
2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
{noformat}

{noformat}
-  Logs from nodeB -
2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
from nodeA:30302
2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
blk_3317546151_2244173147 file 
/data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}
 
{noformat}
-  Logs from nodeC -
2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
from nodeB:56486
2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
type=LAST_IN_PIPELINE terminating
2022-10-23 12:36:38,891 INFO  [/data-1/hdfs/current] 
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
blk_3317546151_2244173147 file 
/data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}

My observation but can't confirm anything since I don't have logs from first 
node in pipeline (nodeA) or hdfs client.
Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the 
last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory 
since nameonode got an acknowledgement from the client that updatePipeline 
succeeded. 


*Improvement:*
One suggested improvement is to delay deleting the blocks with old generation 
stamp until the block is replicated sufficiently whi

[jira] [Created] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-10-31 Thread Rushabh Shah (Jira)
Rushabh Shah created HDFS-16829:
---

 Summary: Delay deleting blocks with older generation stamp until 
the block is fully replicated.
 Key: HDFS-16829
 URL: https://issues.apache.org/jira/browse/HDFS-16829
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, namenode
Affects Versions: 2.10.1
Reporter: Rushabh Shah


We encountered this data loss issue in one of our production clusters which 
runs hbase service. We received a missing block alert in this cluster. This 
error was logged in the datanode holding the block.
{noformat}
2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
/nodeA:31722 dst: 
java.io.IOException:  Offset 64410559 and length 4096 don't match block 
BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
at java.lang.Thread.run(Thread.java:750)
{noformat}

The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
59158528 but the length of this block according to namenode is 64414655 
(according to fsck)

This are the sequence of events for this block.

 
1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
genstamp: 2244173147. 
2. The first datanode in the pipeline (This physical host was also running 
region server process which was hdfs client) was restarting at the same time. 
Unfortunately this node was sick and it didn't log anything neither in datanode 
process or regionserver process during the time of block creation.
3. Namenode updated the pipeline just with the first node.
4. Namenode logged updatePipeline success with just 1st node nodeA with block 
size: 64414655 and new generation stamp: 2244173222
5. Namenode asked nodeB and nodeC to delete the block since it has old 
generation stamp.
6. All the reads (client reads and data transfer reads) from nodeA are failing 
with the above stack trace.
See logs below from namenode and nodeB and nodeC.


{noformat}
 Logs from namenode  -
2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - BLOCK* 
allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 , 
nodeC:51010 for 
2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
updatePipeline(blk_3317546151_2244173147, newGS=2244173222, newLength=64414655, 
newNodes=[nodeA:51010], client=DFSClient_NONMAPREDUCE_1038417265_1)
2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
{noformat}

{noformat}
-  Logs from nodeB -
2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
from nodeA:30302
2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
blk_3317546151_2244173147 file 
/data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}
 
{noformat}
-  Logs from nodeC -
2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
from nodeB:56486
2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
type=LAST_IN_PIPELINE terminating
2022-10-23 12:36:38,891 INFO  [/data-1/hdfs/current] 
impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
blk_3317546151_2244173147 file 
/data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
{noformat}

My observation but can't confirm anything since I don't have logs from first 
node in pipeline (nodeA) or hdfs client.
Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the 
last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory