[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827839#comment-17827839 ] Rushabh Shah commented on HDFS-17299: - Merged to branch-2.10 and branch-3.3 also. Thank you [~gargrite] for the patch. > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Labels: pull-request-available > Fix For: 2.10.3, 3.3.9, 3.4.1, 3.5.0 > > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > a
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Fix Version/s: 2.10.3 > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Labels: pull-request-available > Fix For: 2.10.3, 3.3.9, 3.4.1, 3.5.0 > > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.h
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Fix Version/s: 3.3.7 > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Labels: pull-request-available > Fix For: 3.3.7, 3.4.1, 3.5.0 > > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdf
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Fix Version/s: 3.3.9 (was: 3.3.7) > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Labels: pull-request-available > Fix For: 3.3.9, 3.4.1, 3.5.0 > > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824197#comment-17824197 ] Rushabh Shah commented on HDFS-17299: - [~slfan1989] By mistake, I cherry-picked this change to branch-3.4.0. I was not aware that you are trying to do an RC from this branch. This is the commit: https://github.com/apache/hadoop/commit/f62f116b75dc999c2d4a1824cf099c34e64e74aa > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus
[jira] [Resolved] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HDFS-17299. - Fix Version/s: 3.4.1 3.5.0 Resolution: Fixed > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824194#comment-17824194 ] Rushabh Shah commented on HDFS-17299: - Thank you [~gargrite] for the PR. The patch applied cleanly to trunk, branch-3.4 and branch-3.4.0 There are some conflicts applying it to branch-3.3, branch-3.2 and branch-2.10. Can you please create PR for these branches? Thank you. > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Labels: pull-request-available > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) >
[jira] [Created] (HDFS-17409) TestDFSAdmin has couple of flaky tests.
Rushabh Shah created HDFS-17409: --- Summary: TestDFSAdmin has couple of flaky tests. Key: HDFS-17409 URL: https://issues.apache.org/jira/browse/HDFS-17409 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.5.0 Reporter: Rushabh Shah TestDFSAdmin has couple of flaky tests. TestDFSAdmin#testAllDatanodesReconfig TestDFSAdmin#testDecommissionDataNodesReconfig Seeing failure in couple of separate PRs: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6583/7/#showFailuresLink https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/ testDecommissionDataNodesReconfig is failing locally with the following error: {noformat} java.lang.AssertionError at org.junit.Assert.fail(Assert.java:87) at org.junit.Assert.assertTrue(Assert.java:42) at org.junit.Assert.assertTrue(Assert.java:53) at org.apache.hadoop.hdfs.tools.TestDFSAdmin.testDecommissionDataNodesReconfig(TestDFSAdmin.java:1356) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805723#comment-17805723 ] Rushabh Shah commented on HDFS-16064: - [~KevinWikant] By any chance you have draft PR for branch-2.10 saved? > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805704#comment-17805704 ] Rushabh Shah commented on HDFS-16064: - Thank you [~KevinWikant] for your reply. We also hit this bug recently in our production cluster. One of my colleague is trying to backport this patch to branch-2.10. Let us see if we have any luck backporting this change. > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsu
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805302#comment-17805302 ] Rushabh Shah commented on HDFS-16064: - [~KevinWikant] [~aajisaka] Any reason why we haven't backported this fix to branch-2.10? > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801889#comment-17801889 ] Rushabh Shah commented on HDFS-17299: - Thank you [~ayushtkn] > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Assignee: Ritesh >Priority: Critical > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.jav
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801867#comment-17801867 ] Rushabh Shah commented on HDFS-17299: - [~gargrite] is interested to work on this jira. [~ayushtkn] [~hexiaoqiao] Can one of you please add him to the Contributors list so that I can assign the Jira to him. Thank you! > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Description: Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. Our configuration: 1. We use 3 Availability Zones (AZs) for fault tolerance. 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. 3. We use the following configuration parameters: dfs.namenode.heartbeat.recheck-interval: 60 dfs.heartbeat.interval: 3 So it will take 123 ms (20.5mins) to detect that datanode is dead. Steps to reproduce: # Bring down 1 AZ. # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush on the newly created file. # DataStreamer is not able to find blocks locations that satisfies the rack placement policy (one copy in each rack which essentially means one copy in each AZ) # Since all the datanodes in that AZ are down but still alive to namenode, the client gets different datanodes but still all of them are in the same AZ. See logs below. # HBase is not able to create a WAL file and it aborts the region server. Relevant logs from hdfs client and namenode {noformat} 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - allowed=true ugi=hbase/ (auth:KERBEROS) ip= cmd=create src=/hbase/WALs/ dst=null 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652565_140946716, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874--1594838129323:blk_1214652565_140946716 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652580_140946764, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652594_140946796, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796 2023-12-16 17:17:44,732 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK] 2023-12-16 17:17:44,855 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652607_140946850, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,867 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 a
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800840#comment-17800840 ] Rushabh Shah commented on HDFS-17299: - Thank you [~ayushtkn] [~hexiaoqiao] [~tasanuma] for the healthy discussion. > Excluding a rack in the streamer is quite tricky, we don't know the BPP >neither the Cluster Rack configuration during the {{DataStreamer}} setup. I don't think this is tricky. Like we pass excluded nodes, we have to provide excludedRacks argument to the namenode. On the namenode side, in the scope parameter we will have to set the scope as "{*}~"{*} The negation sign at the start of the argument is important as it will excluded the whole rack. We can add some extra protection for if the excluded rack is the only rack that is available then ignore the parameter. Since everyone is of the opinion of handling this on the client side, I am fine with that too. I don't see any problem with writing and committing the file with 2 replicas and let namenode handle the replication later. Let me create the patch for review soon. Thank you again ! > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStre
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800217#comment-17800217 ] Rushabh Shah commented on HDFS-17299: - > Maybe if you would have put > dfs.client.block.write.replace-datanode-on-failure.enable as false, it > wouldn't have tried to replace the DN itself & went ahead with 2 DN from > other AZ? It is entirely possible that I am not reading the code right. I am little bit out of sync with the DataStreamer codebase. But I don't see this config property dfs.client.block.write.replace-datanode-on-failure.enable being used anywhere in the PIPELINE_SETUP_CREATE phase. I am looking at the branch-2.10 branch. This is the code flow. [DataStreamer#run()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L708-L711] --> [nextBlockOutputStream()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1655] --> [createBlockOutputStream()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1751] There is a retry within nextBlockOutputStream via dfs.client.block.write.retries but it doesn't take dfs.client.block.write.replace-datanode-on-failure.enable in consideration. Cc [~ayushtkn] [~hexiaoqiao] > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799891#comment-17799891 ] Rushabh Shah commented on HDFS-17299: - >From the file creation (at 2023-12-16 17:17:43,818) to the DataStreamer giving >up (at 2023-12-16 17:17:44,988), it just took little more than 1 second. I >doubt keeping stale datanode interval to 9 seconds will help. I still think my >proposed fix will help in such cases. [~ayushtkn] WDYT? > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBl
[jira] [Comment Edited] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799871#comment-17799871 ] Rushabh Shah edited comment on HDFS-17299 at 12/22/23 4:28 PM: --- bq. You should find the reason why the nodes were chosen in 1 AZ only, why no node was chosen in other 2 AZ, it will fallback to choosing nodes in 1 AZ(rack) only when it fails to spread them to different racks It is choosing 1 node from each rack. We have replication factor 3 and we have 3 AZ (i.e. 3 racks) and we use BlockPlacementPolicyRackFaultTolerant as our block placement policy. While creating a write pipeline, it will choose 1 node from each AZ. But since 1 AZ is down, the datanode chosen in the bad AZ will always fail. bq. Did you explore dfs.namenode.avoid.write.stale.datanode and dfs.namenode.stale.datanode.interval. I believe that can bring down your value to few seconds, 30 by default but you can get that down to 3*Heartbeat interval IIRC. Haven't explored these configs yet. Thank you for pointing this out. Will take a look today. was (Author: shahrs87): bq. You should find the reason why the nodes were chosen in 1 AZ only, why no node was chosen in other 2 AZ, it will fallback to choosing nodes in 1 AZ(rack) only when it fails to spread them to different racks It is choosing 1 node from each rack. We have replication factor 3 and we have 3 AZ (i.e. 3 racks) and we use BlockPlacementPolicyRackFaultTolerant as our block placement policy. While creating a write pipeline, it will choose 1 node from each AZ. But since 1 AZ is down, the datanode chosen in the bad AZ will always fail. bq. Did you explore dfs.namenode.avoid.write.stale.datanode and dfs.namenode.stale.datanode.interval. I believe that can bring down your value to few seconds, 30 by default but you can get that down to 3*Heartbeat interval IIRC. Haven't explored this configs yet. Thank you for pointing this out. Will take a look today. > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50
[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799871#comment-17799871 ] Rushabh Shah commented on HDFS-17299: - bq. You should find the reason why the nodes were chosen in 1 AZ only, why no node was chosen in other 2 AZ, it will fallback to choosing nodes in 1 AZ(rack) only when it fails to spread them to different racks It is choosing 1 node from each rack. We have replication factor 3 and we have 3 AZ (i.e. 3 racks) and we use BlockPlacementPolicyRackFaultTolerant as our block placement policy. While creating a write pipeline, it will choose 1 node from each AZ. But since 1 AZ is down, the datanode chosen in the bad AZ will always fail. bq. Did you explore dfs.namenode.avoid.write.stale.datanode and dfs.namenode.stale.datanode.interval. I believe that can bring down your value to few seconds, 30 by default but you can get that down to 3*Heartbeat interval IIRC. Haven't explored this configs yet. Thank you for pointing this out. Will take a look today. > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874--1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as :50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Description: Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. Our configuration: 1. We use 3 Availability Zones (AZs) for fault tolerance. 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. 3. We use the following configuration parameters: dfs.namenode.heartbeat.recheck-interval: 60 dfs.heartbeat.interval: 3 So it will take 123 ms (20.5mins) to detect that datanode is dead. Steps to reproduce: # Bring down 1 AZ. # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush on the newly created file. # DataStreamer is not able to find blocks locations that satisfies the rack placement policy (one copy in each rack which essentially means one copy in each AZ) # Since all the datanodes in that AZ are down but still alive to namenode, the client gets different datanodes but still all of them are in the same AZ. See logs below. # HBase is not able to create a WAL file and it aborts the region server. Relevant logs from hdfs client and namenode {noformat} 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - allowed=true ugi=hbase/ (auth:KERBEROS) ip= cmd=create src=/hbase/WALs/ dst=null 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652565_140946716, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874--1594838129323:blk_1214652565_140946716 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652580_140946764, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652594_140946796, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796 2023-12-16 17:17:44,732 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK] 2023-12-16 17:17:44,855 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652607_140946850, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,867 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 a
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Description: Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. Our configuration: 1. We use 3 Availability Zones (AZs) for fault tolerance. 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. 3. We use the following configuration parameters: dfs.namenode.heartbeat.recheck-interval: 60 dfs.heartbeat.interval: 3 So it will take 123 ms (20.5mins) to detect that datanode is dead. Steps to reproduce: # Bring down 1 AZ. # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush on the newly created file. # DataStreamer is not able to find blocks locations that satisfies the rack placement policy (one copy in each rack which essentially means one copy in each AZ) # Since all the datanodes in that AZ are down but still alive to namenode, the client gets different datanodes but still all of them are in the same AZ. See logs below. # HBase is not able to create a WAL file and it aborts the region server. Relevant logs from hdfs client and namenode {noformat} 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - allowed=true ugi=hbase/ (auth:KERBEROS) ip= cmd=create src=/hbase/WALs/ dst=null 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652565_140946716, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874--1594838129323:blk_1214652565_140946716 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652580_140946764, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652594_140946796, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796 2023-12-16 17:17:44,732 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK] 2023-12-16 17:17:44,855 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652607_140946850, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,867 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 a
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Description: Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. Our configuration: 1. We use 3 Availability Zones (AZs) for fault tolerance. 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. 3. We use the following configuration parameters: dfs.namenode.heartbeat.recheck-interval: 60 dfs.heartbeat.interval: 3 So it will take 123 ms (20.5mins) to detect that datanode is dead. Steps to reproduce: # Bring down 1 AZ. # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush on the newly created file. # DataStreamer is not able to find blocks locations that satisfies the rack placement policy (one copy in each rack which essentially means one copy in each AZ) # Since all the datanodes in that AZ are down but still alive to namenode, the client gets different datanodes but still all of them are in the same AZ. See logs below. # HBase is not able to create a WAL file and it aborts the region server. Relevant logs from hdfs client and namenode {noformat} 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - allowed=true ugi=hbase/ (auth:KERBEROS) ip= cmd=create src=/hbase/WALs/ dst=null 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652565_140946716, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874--1594838129323:blk_1214652565_140946716 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652580_140946764, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652594_140946796, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796 2023-12-16 17:17:44,732 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK] 2023-12-16 17:17:44,855 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652607_140946850, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,867 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStream java.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 a
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Description: Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. Our configuration: 1. We use 3 Availability Zones (AZs) for fault tolerance. 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. 3. We use the following configuration parameters: dfs.namenode.heartbeat.recheck-interval: 60 dfs.heartbeat.interval: 3 So it will take 123 ms (20.5mins) to detect that datanode is dead. Steps to reproduce: # Bring down 1 AZ. # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush on the newly created file. # DataStreamer is not able to find blocks locations that satisfies the rack placement policy (one copy in each rack which essentially means one copy in each AZ) # Since all the datanodes in that AZ are down but still alive to namenode, the client gets different datanodes but still all of them are in the same AZ. See logs below. # HBase is not able to create a WAL file and it aborts the region server. Relevant logs from hdfs client and namenode {noformat} 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - allowed=true ugi=hbase/ (auth:KERBEROS) ip= cmd=create src=/hbase/WALs/ dst=null 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652565_140946716, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874--1594838129323:blk_1214652565_1409467162023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652580_140946764, replicas=:50010, :50010, :50010 for /hbase/WALs/2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652580_1409467642023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652594_140946796, replicas=:50010, :50010, :50010 for /hbase/WALs/2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652594_1409467962023-12-16 17:17:44,732 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK] 2023-12-16 17:17:44,855 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652607_140946850, replicas=:50010, :50010, :50010 for /hbase/WALs/2023-12-16 17:17:44,867 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010
[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17299: Priority: Critical (was: Major) > HDFS is not rack failure tolerant while creating a new file. > > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 60 > dfs.heartbeat.interval: 3 > So it will take 123 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {code:java} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=trueugi=hbase/ (auth:KERBEROS) ip= > cmd=create src=/hbase/WALs/ dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=:50010, > :50010, :50010 for /hbase/WALs/ > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status > message , ack with firstBadLink as :50010 at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - > Abandoning > BP-179318874--1594838129323:blk_1214652565_1409467162023-12-16 > 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode > DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=:50010, > :50010, :50010 for /hbase/WALs/2023-12-16 > 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status > message , ack with firstBadLink as :50010 at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at > org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 > 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_1409467642023-12-16 > 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=:50010, > :50010, :50010 for /hbase/WALs/2023-12-16 > 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status > message , ack with firstBadLink as :50010 at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at > org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 > 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-15948
[jira] [Created] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.
Rushabh Shah created HDFS-17299: --- Summary: HDFS is not rack failure tolerant while creating a new file. Key: HDFS-17299 URL: https://issues.apache.org/jira/browse/HDFS-17299 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.10.1 Reporter: Rushabh Shah Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. Our configuration: 1. We use 3 Availability Zones (AZs) for fault tolerance. 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. 3. We use the following configuration parameters: dfs.namenode.heartbeat.recheck-interval: 60 dfs.heartbeat.interval: 3 So it will take 123 ms (20.5mins) to detect that datanode is dead. Steps to reproduce: # Bring down 1 AZ. # HBase (HDFS client) tries to create a file (WAL file) and then calls hflush on the newly created file. # DataStreamer is not able to find blocks locations that satisfies the rack placement policy (one copy in each rack which essentially means one copy in each AZ) # Since all the datanodes in that AZ are down but still alive to namenode, the client gets different datanodes but still all of them are in the same AZ. See logs below. # HBase is not able to create a WAL file and it aborts the region server. Relevant logs from hdfs client and namenode {code:java} 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - allowed=true ugi=hbase/ (auth:KERBEROS) ip= cmd=create src=/hbase/WALs/ dst=null 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652565_140946716, replicas=:50010, :50010, :50010 for /hbase/WALs/ 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874--1594838129323:blk_1214652565_1409467162023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652580_140946764, replicas=:50010, :50010, :50010 for /hbase/WALs/2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652580_1409467642023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652594_140946796, replicas=:50010, :50010, :50010 for /hbase/WALs/2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in createBlockOutputStreamjava.io.IOException: Got error, status=ERROR, status message , ack with firstBadLink as :50010 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)2023-12-16 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning BP-179318874-NN-IP-1594838129323:blk_1214652594_1409467962023-12-16 17:17:44,732 WARN [Thread-39087] hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK] 2023-12-16 17:17:44,855 INFO [on default port 9000] hdfs.StateChange - BLOCK* allocate blk_1214652607_140946850, replicas=:50010, :50010, :50010 for /hbase/WALs/2023-12-16 17:17:44,867 INFO
[jira] [Commented] (HDFS-16540) Data locality is lost when DataNode pod restarts in kubernetes
[ https://issues.apache.org/jira/browse/HDFS-16540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770545#comment-17770545 ] Rushabh Shah commented on HDFS-16540: - > I'll let folks ask for the backport before doing it for branch-3.3. Since the patch here fixes the incident that caused data loss in our production env, I would prefer we backport the patch to all active branches. [~stack] WDYT? I can create a patch if you are busy. > Data locality is lost when DataNode pod restarts in kubernetes > --- > > Key: HDFS-16540 > URL: https://issues.apache.org/jira/browse/HDFS-16540 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.2 >Reporter: Huaxiang Sun >Assignee: Huaxiang Sun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 8h 20m > Remaining Estimate: 0h > > We have HBase RegionServer and Hdfs DataNode running in one pod. When the pod > restarts, we found that data locality is lost after we do a major compaction > of hbase regions. After some debugging, we found that upon pod restarts, its > ip changes. In DatanodeManager, maps like networktopology are updated with > the new info. host2DatanodeMap is not updated accordingly. When hdfs client > with the new ip tries to find a local DataNode, it fails. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540
[ https://issues.apache.org/jira/browse/HDFS-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17188: Description: Recently we saw missing blocks in our production clusters running on dynamic environments like AWS. We are running some version of hadoop-2.10 code line. Events that led to data loss: # We have pool of available IP address and whenever datanode restarts we use any available IP address from that pool. # We have seen during the lifetime of namenode process, multiple datanodes were restarted and the same datanode has used different IP address. # One case that I was debugging was very interesting. DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> ip-address-3 DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> ip-address-1 Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is the first ip address of DN-UUID-1 # There was some bug in our operational script which led to all datanodes getting restarted at the same time. Just after the restart, we see the following log lines. {noformat} 2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.1:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.3:50010 2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.4:50010 {noformat} This line is logged [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1184]. Snippet below: {code:java} DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid()); DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr( nodeReg.getIpAddr(), nodeReg.getXferPort()); if (nodeN != null && nodeN != nodeS) { NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN); // nodeN previously served a different data storage, // which is not served by anybody anymore. removeDatanode(nodeN); // physically remove node from datanodeMap wipeDatanode(nodeN); nodeN = null; } {code} This happens when the DatanodeDescriptor is not the same in datanodeMap and host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data loss. :) By filing this jira, I want to discuss following things: # Do we really want to call removeDatanode method from namenode whenever any such discrepancy in maps is spotted? or Can we rely on the first full block report or the periodic full block report from the datanode to fix the metadata? # Improve logging in the blockmanagement code to debug these issues faster. # Add a test case with the exact events that occured in our env and still make sure that datanodeMap and host2DatanodeMap are consistent. was: Recently we saw missing blocks in our production clusters running on dynamic environments like AWS. We are running some version of hadoop-2.10 code line. Events that led to data loss: # We have pool of available IP address and whenever datanode restarts we use any available IP address from that pool. # We have seen during the lifetime of namenode process, multiple datanodes were restarted and the same datanode has used different IP address. # One case that I was debugging was very interesting. DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> ip-address-3 DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> ip-address-1 Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is the first ip address of DN-UUID-1 # There was some bug in our operational script which led to all datanodes getting restarted at the same time. Just after the restart, we see the following log lines. {noformat} 2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.1:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.3:50010 2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.4:50010 {noformat} This line is logged [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/Datanode
[jira] [Updated] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540
[ https://issues.apache.org/jira/browse/HDFS-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17188: Description: Recently we saw missing blocks in our production clusters running on dynamic environments like AWS. We are running some version of hadoop-2.10 code line. Events that led to data loss: # We have pool of available IP address and whenever datanode restarts we use any available IP address from that pool. # We have seen during the lifetime of namenode process, multiple datanodes were restarted and the same datanode has used different IP address. # One case that I was debugging was very interesting. DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> ip-address-3 DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> ip-address-1 Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is the first ip address of DN-UUID-1 # There was some bug in our operational script which led to all datanodes getting restarted at the same time. Just after the restart, we see the following log lines. {noformat} 2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.1:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.3:50010 2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.4:50010 {noformat} This line is logged [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1184]. Snippet below: {code:java} DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid()); DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr( nodeReg.getIpAddr(), nodeReg.getXferPort()); if (nodeN != null && nodeN != nodeS) { NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN); // nodeN previously served a different data storage, // which is not served by anybody anymore. removeDatanode(nodeN); // physically remove node from datanodeMap wipeDatanode(nodeN); nodeN = null; } {code} This happens when the DatanodeDescriptor is not the same in datanodeMap and host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data loss. :) By filing this jira, I want to discuss following things: # Do we really want to call removeDatanode method from namenode whenever any such discrepancy in maps is spotted? Can we rely on the first full block report or the periodic full block report from the datanode to fix the metadata? # Improve logging in the blockmanagement code to debug these issues faster. # Add a test case with the exact events that occured in our env and still make sure that datanodeMap and host2DatanodeMap are consistent. was: Recently we saw missing blocks in our production clusters running on dynamic environments like AWS. We are running some version of hadoop-2.10 code line. Events that led to data loss: # We have pool of available IP address and whenever datanode restarts we use any available IP address from that pool. # We have seen during the lifetime of namenode process, multiple datanodes were restarted and the same datanode has used different IP address. # One case that I was debugging was very interesting. DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> ip-address-3 DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> ip-address-1 Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is the first ip address of DN-UUID-1 # There was some bug in our operational script which led to all datanodes getting restarted at the same time. Just after the restart, we see the following log lines. {noformat} 2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.1:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.3:50010 2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.4:50010 {noformat} This line is logged [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeMana
[jira] [Updated] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540
[ https://issues.apache.org/jira/browse/HDFS-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-17188: Description: Recently we saw missing blocks in our production clusters running on dynamic environments like AWS. We are running some version of hadoop-2.10 code line. Events that led to data loss: # We have pool of available IP address and whenever datanode restarts we use any available IP address from that pool. # We have seen during the lifetime of namenode process, multiple datanodes were restarted and the same datanode has used different IP address. # One case that I was debugging was very interesting. DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> ip-address-3 DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> ip-address-1 Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is the first ip address of DN-UUID-1 # There was some bug in our operational script which led to all datanodes getting restarted at the same time. Just after the restart, we see the following log lines. {noformat} 2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.1:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.3:50010 2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.4:50010 {noformat} This line is logged [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1184]. Snippet below: {code:java} DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid()); DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr( nodeReg.getIpAddr(), nodeReg.getXferPort()); if (nodeN != null && nodeN != nodeS) { NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN); // nodeN previously served a different data storage, // which is not served by anybody anymore. removeDatanode(nodeN); // physically remove node from datanodeMap wipeDatanode(nodeN); nodeN = null; } {code} This happens when the DatanodeDescriptor is not the same in datanodeMap and host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data loss. :) By filing this jira, I want to discuss following things: # Do we really want to call removeDatanode method from namenode whenever any such discrepancy in maps is spotted? Can we rely on the first full block report or the periodic full block report from the datanode to fix the metadata? # Improve logging in the blockmanagement code to debug these issues faster. # Add a test case with the exact events that occured in our env and still make sure that datanodeMap and host2DatanodeMap are consistent. was: Recently we saw missing blocks in our production clusters running on dynamic environments like AWS. We are running some version of hadoop-2.10 code line. Events that led to data loss: # We have pool of available IP address and whenever datanode restarts we use any available IP address from that pool. # We have seen during the lifetime of namenode process, multiple datanodes were restarted and the same datanode has used different IP address. # One case that I was debugging was very interesting. DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> ip-address-3 DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> ip-address-1 Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is the first ip address of DN-UUID-1 # There was some bug in our operational script which led to all datanodes getting restarted at the same time. Just after the restart, we see the following log lines. {noformat} 2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.1:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.3:50010 2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.4:50010 {noformat} This line is logged [here|#L1184].] Snippet below: {code:java} DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid()); DatanodeDescriptor nodeN = ho
[jira] [Created] (HDFS-17188) Data loss in our production clusters due to missing HDFS-16540
Rushabh Shah created HDFS-17188: --- Summary: Data loss in our production clusters due to missing HDFS-16540 Key: HDFS-17188 URL: https://issues.apache.org/jira/browse/HDFS-17188 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.10.1 Reporter: Rushabh Shah Assignee: Rushabh Shah Recently we saw missing blocks in our production clusters running on dynamic environments like AWS. We are running some version of hadoop-2.10 code line. Events that led to data loss: # We have pool of available IP address and whenever datanode restarts we use any available IP address from that pool. # We have seen during the lifetime of namenode process, multiple datanodes were restarted and the same datanode has used different IP address. # One case that I was debugging was very interesting. DN with datanode UUID DN-UUID-1 moved from ip-address-1 --> ip-address-2 --> ip-address-3 DN with datanode UUID DN-UUID-2 moved from ip-address-4 --> ip-address-5 --> ip-address-1 Observe the last IP address change for DN-UUID-2. It is ip-address-1 which is the first ip address of DN-UUID-1 # There was some bug in our operational script which led to all datanodes getting restarted at the same time. Just after the restart, we see the following log lines. {noformat} 2023-08-26 04:04:41,964 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.1:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:45,720 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.2:50010 2023-08-26 04:04:51,680 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.3:50010 2023-08-26 04:04:55,328 INFO [on default port 9000] namenode.NameNode - BLOCK* registerDatanode: 10.x.x.4:50010 {noformat} This line is logged [here|#L1184].] Snippet below: {code:java} DatanodeDescriptor nodeS = getDatanode(nodeReg.getDatanodeUuid()); DatanodeDescriptor nodeN = host2DatanodeMap.getDatanodeByXferAddr( nodeReg.getIpAddr(), nodeReg.getXferPort()); if (nodeN != null && nodeN != nodeS) { NameNode.LOG.info("BLOCK* registerDatanode: " + nodeN); // nodeN previously served a different data storage, // which is not served by anybody anymore. removeDatanode(nodeN); // physically remove node from datanodeMap wipeDatanode(nodeN); nodeN = null; } {code} This happens when the DatanodeDescriptor is not the same in datanodeMap and host2DatanodeMap. HDFS-16540 fixed this bug for lost data locality and not data loss. :) By filing this jira, I want to discuss following things: # Do we really want to call removeDatanode method from namenode whenever any such discrepancy in maps is spotted? Can we rely on the first full block report or the periodic full block report from the datanode to fix the metadata? # Improve logging in the blockmanagement code to debug these issues faster. # Add a test case with the exact events that occured in our env and still make sure that datanodeMap and host2DatanodeMap are consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629131#comment-17629131 ] Rushabh Shah commented on HDFS-16829: - > If I catch correct if you had that patch in, Then write would have failed > and you would be still safe from the data loss atleast If we had that patch from HDFS-16146 in, then it would have added a new datanode in the pipeline and the write might have succeeded. I don't fully understand what state was nodeA in? Because when we add a new node to the pipeline, the existing node has to copy the block to the new node. If the first node is actually sick and was restarting, I don't know how that datanode would handled this scenario. > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > newLength=64414655, newNodes=[nodeA:51010], > client=DFSClient_NONMAPREDUCE_1038417265_1) > 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* > completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 > {noformat} > {noformat} > - Logs from nodeB - > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeA:30302 > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating > 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] > impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 > blk_3317546151_2244173147 file > /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 >
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629109#comment-17629109 ] Rushabh Shah commented on HDFS-16829: - > That is my point like in your case the pipeline landed up having one 1 > datanode and that datanode had some issues. We use the default value i.e. 0. So it will always throw IOException if it is unable to find a datanode. In our case, it didn't even attempt to get an additional datanode since the pipeline was in PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY stage and we didn't have HDFS-16146 in our production environment. So it bailed out before even attempting to get an additional datanode. [~ayushtkn] > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > newLength=64414655, newNodes=[nodeA:51010], > client=DFSClient_NONMAPREDUCE_1038417265_1) > 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* > completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 > {noformat} > {noformat} > - Logs from nodeB - > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeA:30302 > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating > 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] > impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 > blk_3317546151_2244173147 file > /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 > {noformat} > > {noformat} >
[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628011#comment-17628011 ] Rushabh Shah edited comment on HDFS-16829 at 11/4/22 3:48 PM: -- {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. {quote} -I think changing this value will work.- This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]. But before that there used to be a check (in 2.10 branch) which will not add a node to pipeline if pipeline stage is PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146 We don't have HDFS-16146 in our production environment. was (Author: shahrs87): {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. {quote} I think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]. But before that there used to be a check (in 2.10 branch) which will not add a node to pipeline if pipeline stage is PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146 We don't have HDFS-16146 in our production environment. > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > ne
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629057#comment-17629057 ] Rushabh Shah commented on HDFS-16829: - {quote} dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. I think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492] {quote} Actually this config *won't* work. This config property is _only_ used if we are unable to find any good datanode. If we still have dfs.client.block.write.replace-datanode-on-failure.min-replication number of datanodes in the pipeline, the client will go ahead and continue writing the data otherwise it will throw an exception. So I think this Jira will still provide value with some extra disk space. > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > newLength=64414655, newNodes=[nodeA:51010], > client=DFSClient_NONMAPREDUCE_1038417265_1) > 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* > completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 > {noformat} > {noformat} > - Logs from nodeB - > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeA:30302 > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=HAS_D
[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629057#comment-17629057 ] Rushabh Shah edited comment on HDFS-16829 at 11/4/22 3:47 PM: -- {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. I think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492] {quote} Actually this config *won't* work. This config property is _only_ used if we are unable to find any good datanode. If we still have dfs.client.block.write.replace-datanode-on-failure.min-replication number of datanodes in the pipeline, the client will go ahead and continue writing the data otherwise it will throw an exception. So I think this Jira will still provide value with some extra disk space. [~ayushtkn] was (Author: shahrs87): {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. I think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492] {quote} Actually this config *won't* work. This config property is _only_ used if we are unable to find any good datanode. If we still have dfs.client.block.write.replace-datanode-on-failure.min-replication number of datanodes in the pipeline, the client will go ahead and continue writing the data otherwise it will throw an exception. So I think this Jira will still provide value with some extra disk space. > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for
[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629057#comment-17629057 ] Rushabh Shah edited comment on HDFS-16829 at 11/4/22 3:47 PM: -- {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. I think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492] {quote} Actually this config *won't* work. This config property is _only_ used if we are unable to find any good datanode. If we still have dfs.client.block.write.replace-datanode-on-failure.min-replication number of datanodes in the pipeline, the client will go ahead and continue writing the data otherwise it will throw an exception. So I think this Jira will still provide value with some extra disk space. was (Author: shahrs87): {quote} dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. I think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492] {quote} Actually this config *won't* work. This config property is _only_ used if we are unable to find any good datanode. If we still have dfs.client.block.write.replace-datanode-on-failure.min-replication number of datanodes in the pipeline, the client will go ahead and continue writing the data otherwise it will throw an exception. So I think this Jira will still provide value with some extra disk space. > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022
[jira] [Comment Edited] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628011#comment-17628011 ] Rushabh Shah edited comment on HDFS-16829 at 11/2/22 11:07 PM: --- {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. {quote} I think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]. But before that there used to be a check (in 2.10 branch) which will not add a node to pipeline if pipeline stage is PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146 We don't have HDFS-16146 in our production environment. was (Author: shahrs87): {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. {quote} I don't think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]. But before that there is a check (in 2.10 branch) which will not add a node to pipeline if pipeline stage is PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146 We don't have HDFS-16146 in our production environment. > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > new
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628011#comment-17628011 ] Rushabh Shah commented on HDFS-16829: - {quote}dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics Thank you for pointing out this config. Let me check this out. {quote} I don't think changing this value will work. This config property is used [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1492]. But before that there is a check (in 2.10 branch) which will not add a node to pipeline if pipeline stage is PIPELINE_CLOSE or PIPELINE_CLOSE_RECOVERY. This check was removed by HDFS-16146 We don't have HDFS-16146 in our production environment. > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > newLength=64414655, newNodes=[nodeA:51010], > client=DFSClient_NONMAPREDUCE_1038417265_1) > 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* > completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 > {noformat} > {noformat} > - Logs from nodeB - > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeA:30302 > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating > 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] > impl.FsDatasetAsyncDiskService - D
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627294#comment-17627294 ] Rushabh Shah commented on HDFS-16829: - Thank you [~ayushsaxena] for the response. > What if the client still pushed more data to the single datanode? The data > contained by the other datanodes won't be complete right? The proposal is to keep the blocks on disk so some operator can manually restore them. We don't want to make the replica with old genstamp primary replica automatically. > Moreover with this in genuine cases also you would be occupying the memory > with blocks with old genstamps, if the cluster is really unstable and has a > lot of updatePipelines or things like that, it may be a issue in that case IMHO this shouldn't be an issue. We have ReplicationMonitor thread which works pretty well. Even in case of many update pipelines, ReplicationMonitor is fast enough to replicate the blocks to maintain replication factor. The proposal is just to wait deleting the older genstamp blocks until ReplicationMonitor replicates that block. If you are concerned about this behavior then we can introduce a config key which will default to current behavior and if enabled it will delay until it is fully replicated. > You can get something if you figure out really quick before the configured > time after which the block gets deleted. In this case, the replica with old genstamp was deleted within 3 seconds after the file was closed. One needs to be super human to catch this in production. :) > dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it > doesn't behave that way, good to have a new config with such behaviour > semantics Thank you for pointing out this config. Let me check this out. > I think if there is one node only, we can give it to try that we have >syncBlock always true in such cases, might help or may be not if the Disk has >some issues in some cases... Given that I don't have logs either from datanode or hdfs client, I am not able to understand what exactly happened? :( > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - >
[jira] [Updated] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-16829: Priority: Critical (was: Major) > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > newLength=64414655, newNodes=[nodeA:51010], > client=DFSClient_NONMAPREDUCE_1038417265_1) > 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* > completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 > {noformat} > {noformat} > - Logs from nodeB - > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeA:30302 > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating > 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] > impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 > blk_3317546151_2244173147 file > /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 > {noformat} > > {noformat} > - Logs from nodeC - > 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeB:56486 > 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=LAST_IN_PIPELINE terminating > 2022-10-23 12:36:38,891 INFO [/data-1/hdfs/current] > impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 > blk
[jira] [Updated] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah updated HDFS-16829: Description: We encountered this data loss issue in one of our production clusters which runs hbase service. We received a missing block alert in this cluster. This error was logged in the datanode holding the block. {noformat} 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - nodeA:51010:DataXceiver error processing READ_BLOCK operation src: /nodeA:31722 dst: java.io.IOException: Offset 64410559 and length 4096 don't match block BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) at java.lang.Thread.run(Thread.java:750) {noformat} The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 59158528 but the length of this block according to namenode is 64414655 (according to fsck) This are the sequence of events for this block. 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and genstamp: 2244173147. 2. The first datanode in the pipeline (This physical host was also running region server process which was hdfs client) was restarting at the same time. Unfortunately this node was sick and it didn't log anything neither in datanode process or regionserver process during the time of block creation. 3. Namenode updated the pipeline just with the first node. 4. Namenode logged updatePipeline success with just 1st node nodeA with block size: 64414655 and new generation stamp: 2244173222 5. Namenode asked nodeB and nodeC to delete the block since it has old generation stamp. 6. All the reads (client reads and data transfer reads) from nodeA are failing with the above stack trace. See logs below from namenode and nodeB and nodeC. {noformat} Logs from namenode - 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 , nodeC:51010 for 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - updatePipeline(blk_3317546151_2244173147, newGS=2244173222, newLength=64414655, newNodes=[nodeA:51010], client=DFSClient_NONMAPREDUCE_1038417265_1) 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 {noformat} {noformat} - Logs from nodeB - 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 from nodeA:30302 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 blk_3317546151_2244173147 file /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 {noformat} {noformat} - Logs from nodeC - 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 from nodeB:56486 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, type=LAST_IN_PIPELINE terminating 2022-10-23 12:36:38,891 INFO [/data-1/hdfs/current] impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 blk_3317546151_2244173147 file /data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 {noformat} My observation but can't confirm anything since I don't have logs from first node in pipeline (nodeA) or hdfs client. Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory since nameonode got an acknowledgement from the client that updatePipeline succeeded. *Improvement:* One suggested improvement is to delay deleting the blocks with old generation stamp until the block is replicated sufficiently whi
[jira] [Created] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
Rushabh Shah created HDFS-16829: --- Summary: Delay deleting blocks with older generation stamp until the block is fully replicated. Key: HDFS-16829 URL: https://issues.apache.org/jira/browse/HDFS-16829 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.10.1 Reporter: Rushabh Shah We encountered this data loss issue in one of our production clusters which runs hbase service. We received a missing block alert in this cluster. This error was logged in the datanode holding the block. {noformat} 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - nodeA:51010:DataXceiver error processing READ_BLOCK operation src: /nodeA:31722 dst: java.io.IOException: Offset 64410559 and length 4096 don't match block BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) at java.lang.Thread.run(Thread.java:750) {noformat} The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 59158528 but the length of this block according to namenode is 64414655 (according to fsck) This are the sequence of events for this block. 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and genstamp: 2244173147. 2. The first datanode in the pipeline (This physical host was also running region server process which was hdfs client) was restarting at the same time. Unfortunately this node was sick and it didn't log anything neither in datanode process or regionserver process during the time of block creation. 3. Namenode updated the pipeline just with the first node. 4. Namenode logged updatePipeline success with just 1st node nodeA with block size: 64414655 and new generation stamp: 2244173222 5. Namenode asked nodeB and nodeC to delete the block since it has old generation stamp. 6. All the reads (client reads and data transfer reads) from nodeA are failing with the above stack trace. See logs below from namenode and nodeB and nodeC. {noformat} Logs from namenode - 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 , nodeC:51010 for 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - updatePipeline(blk_3317546151_2244173147, newGS=2244173222, newLength=64414655, newNodes=[nodeA:51010], client=DFSClient_NONMAPREDUCE_1038417265_1) 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 {noformat} {noformat} - Logs from nodeB - 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 from nodeA:30302 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 blk_3317546151_2244173147 file /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 {noformat} {noformat} - Logs from nodeC - 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 from nodeB:56486 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, type=LAST_IN_PIPELINE terminating 2022-10-23 12:36:38,891 INFO [/data-1/hdfs/current] impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 blk_3317546151_2244173147 file /data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 {noformat} My observation but can't confirm anything since I don't have logs from first node in pipeline (nodeA) or hdfs client. Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory