[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827848#comment-17827848
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2003013056

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 27s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  13m  5s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   2m 11s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 48s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 47s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 56s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   2m 44s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/9/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 25s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m  7s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   2m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 44s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m 44s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 40s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 283 unchanged - 2 fixed = 283 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m 37s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 12s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   4m 44s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 28s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  97m  4s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/9/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 153m 25s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.TestFileLengthOnClusterRestart |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 4fef73a925b4 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827844#comment-17827844
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2002971355

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 24s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  10m 26s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   1m 30s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 16s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 17s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 56s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   1m 43s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/8/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 18s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  3s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 24s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 283 unchanged - 2 fixed = 283 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 11s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   2m 49s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 13s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  74m  5s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/8/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 113m 47s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.web.TestWebHDFS |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.TestDFSInotifyEventInputStream |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 20973d4a3aad 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-2.10 / 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-17 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827839#comment-17827839
 ] 

Rushabh Shah commented on HDFS-17299:
-

Merged to branch-2.10 and branch-3.3 also. Thank you [~gargrite] for the patch.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.10.3, 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827838#comment-17827838
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 merged PR #6614:
URL: https://github.com/apache/hadoop/pull/6614




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827805#comment-17827805
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#discussion_r1527571968


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -1707,4 +1708,154 @@ public void testStorageFavouredNodes()
   assertEquals("Number of SSD should be 1 but was : " + numSSD, 1, numSSD);
 }
   }
+
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 6 nodes. 1 per rack for 2 racks and 3 nodes in the 3rd rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(6)
+.racks(new String[] {"/rack1", "/rack2", "/rack3", "/rack3", "/rack3", 
"/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill all the DNs in the 3rd rack.
+  cluster.stopDataNode(5);
+  cluster.stopDataNode(4);
+  cluster.stopDataNode(3);
+  cluster.stopDataNode(2);
+
+  // create a file with replication 3, for rack fault tolerant BPP,
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+}
+  }
+
+  @Test
+  public void 
testSingleRackFailureDuringPipelineSetupMinReplicationImpossible()
+  throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class, 
BlockPlacementPolicy.class);
+
conf.setBoolean(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
 false);
+
conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.MIN_REPLICATION,
 3);
+// 3 racks & 6 nodes. 1 per rack for 2 racks and 3 nodes in the 3rd rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(6)
+.racks(new String[] {"/rack1", "/rack2", "/rack3", "/rack3", "/rack3", 
"/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(5);
+  cluster.stopDataNode(4);
+  cluster.stopDataNode(3);
+  cluster.stopDataNode(2);
+  boolean threw = false;
+  try {
+DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  } catch (IOException e) {
+threw = true;
+  }
+  assertTrue(threw);
+}
+  }
+
+  @Test
+  public void 
testMultipleRackFailureDuringPipelineSetupMinReplicationPossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 1);
+// 3 racks & 3 nodes. 1 per rack

Review Comment:
   The comments are not correct. Can you please update them? @ritegarg 



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -1707,4 +1708,154 @@ public void testStorageFavouredNodes()
   assertEquals("Number of SSD should be 1 but was : " + numSSD, 1, numSSD);
 }
   }
+
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 6 nodes. 1 per rack for 2 racks and 3 nodes in the 3rd rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(6)
+.racks(new String[] {"/rack1", "/rack2", "/rack3", "/rack3", "/rack3", 
"/rack3"}).build()) {
+  

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827762#comment-17827762
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2002275973

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 26s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  10m 48s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   1m 27s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 30s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 15s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 17s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 59s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   1m 42s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/7/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 17s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  7s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m  7s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 23s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 283 unchanged - 2 fixed = 283 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  7s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   2m 47s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 11s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  76m 26s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 26s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 116m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 6bba757d3ead 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827754#comment-17827754
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2002216244

   > All the tests (except for 1 test) that failed in [this 
build](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/6/) are 
flaky. They are failing in daily build also. Check 
[here](https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1333/)
 for the latest build. The only test that is not failing in nightly build is: 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion.testApplyEditLogForDeletion
   > 
   > @ritegarg Can you please check if this is flaky too? Thank you !
   
   I ran the tests locally with and without my changes and I am observing same 
failure in both.




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827751#comment-17827751
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2002199316

   All the tests (except for 1 test) that failed in [this 
build](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/6/)  are 
flaky. They are failing in daily build also. Check 
[here](https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1333/)
 for the latest build.
   The only test that is not failing in nightly build is:
   
org.apache.hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion.testApplyEditLogForDeletion
   
   @ritegarg  Can you please check if this is flaky too? Thank you !
   
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827635#comment-17827635
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2001010847

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 26s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  10m  4s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 30s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 14s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 17s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   1m 42s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/6/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 45s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  2s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 26s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 283 unchanged - 2 fixed = 283 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   2m 43s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 11s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  79m 16s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 119m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.TestDFSInotifyEventInputStream |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.fs.viewfs.TestViewFileSystemHdfs |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux cb6f61f65214 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827626#comment-17827626
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2000788575

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 39s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 33s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  13m  6s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   2m 12s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 51s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 45s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 53s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 59s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 26s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   2m 43s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/5/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m  6s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   2m  6s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m 45s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 41s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/5/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 1 new + 283 unchanged - 2 fixed = 
284 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 47s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 14s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   4m 40s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 34s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  80m 21s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 137m 20s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.TestMultipleNNPortQOP |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.fs.viewfs.TestViewFileSystemHdfs |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827624#comment-17827624
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2000757455

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 25s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 29s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  13m  7s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   2m 12s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 48s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 47s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 56s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   2m 44s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/4/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 24s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m  6s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   2m  6s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m 45s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 39s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/4/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 1 new + 283 unchanged - 2 fixed = 
284 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 12s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   4m 43s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 29s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  98m 43s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 165m 51s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.TestMultipleNNPortQOP |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.TestFileLengthOnClusterRestart |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.hdfs.TestDFSInotifyEventInputStream |
   |   | hadoop.hdfs.server.namenode.TestCheckpoint |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/4/artifact/out/Dockerfile
 |
   | GITHUB PR | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827617#comment-17827617
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-2000644545

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 26s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  10m 15s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   1m 30s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 16s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   1m 43s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/3/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 17s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 10s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 23s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/3/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 1 new + 283 unchanged - 2 fixed = 
284 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m  3s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 11s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   2m 45s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 12s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  80m  2s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 119m 48s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 49763d196457 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827613#comment-17827613
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#discussion_r1526907541


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestCrcCorruption.java:
##
@@ -86,7 +86,7 @@ public void setUp() throws IOException {
* create/write. To recover from corruption while writing, at
* least two replicas are needed.
*/
-  @Test(timeout=5)
+  @Test(timeout=60)

Review Comment:
   Fixed, we needed to add this commit 
https://github.com/apache/hadoop/pull/6614/commits/01d950accaaf929542d68d784a21fee12505ab49
 same as that in trunk/3.3





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.9, 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827279#comment-17827279
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1998589730

   All the failed test in this build are flaky. Merging this PR. Thank you 
@ritegarg  for your contribution !




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827280#comment-17827280
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 merged PR #6612:
URL: https://github.com/apache/hadoop/pull/6612




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827260#comment-17827260
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1998524754

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 21s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 40s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 18s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 32s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 28s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/11/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 18s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 12s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 254 unchanged - 3 fixed = 254 total (was 257)  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 16s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 48s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 174m  2s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/11/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 31s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 277m 34s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.TestBlocksScheduledCounter |
   |   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
   |   | hadoop.hdfs.server.namenode.TestNameNodeMXBean |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/11/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 89ea6b6b9885 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / f780ddf7235805441de364bbc4e9385b2e414527 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/11/testReport/ |
   | Max. process+thread count | 4500 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
   | Console output | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827180#comment-17827180
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#discussion_r1525236336


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestCrcCorruption.java:
##
@@ -86,7 +86,7 @@ public void setUp() throws IOException {
* create/write. To recover from corruption while writing, at
* least two replicas are needed.
*/
-  @Test(timeout=5)
+  @Test(timeout=60)

Review Comment:
   Do we really need to increase the timeout by 10x? 





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827179#comment-17827179
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-1997942382

   @ritegarg Can you please check if the checkstyle  warning is relevant? 




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826921#comment-17826921
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1996287519

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   2m 45s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m  6s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 21s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 39s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 42s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 36s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/9/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 15s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 30s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 30s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 254 unchanged - 3 fixed = 254 total (was 257)  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 42s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 38s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 46s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 183m 10s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/9/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 31s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 291m 33s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.TestDFSStripedInputStream |
   |   | hadoop.hdfs.TestFileChecksumCompositeCrc |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.TestReconstructStripedFile |
   |   | hadoop.hdfs.server.balancer.TestBalancer |
   |   | hadoop.hdfs.tools.TestECAdmin |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 0d63f43cdbae 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / f780ddf7235805441de364bbc4e9385b2e414527 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/9/testReport/ |
   | Max. process+thread count | 4136 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
   | Console output 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826919#comment-17826919
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1996283905

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   3m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m  7s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  24m 22s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 21s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 42s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 34s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 40s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 35s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/8/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 25s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 23s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 34s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 254 unchanged - 3 fixed = 254 total (was 257)  |
   | +1 :green_heart: |  mvnsite  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 48s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 25s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 50s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 177m 13s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/8/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 28s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 287m 38s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.TestReadStripedFileWithDecoding |
   |   | hadoop.hdfs.TestReconstructStripedFileWithValidator |
   |   | hadoop.hdfs.TestDecommissionWithStriped |
   |   | hadoop.hdfs.TestDFSStripedOutputStreamWithRandomECPolicy |
   |   | hadoop.hdfs.TestViewDistributedFileSystem |
   |   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.TestErasureCodingPoliciesWithRandomECPolicy |
   |   | hadoop.hdfs.server.balancer.TestBalancer |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 87f15e31dfb0 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / f780ddf7235805441de364bbc4e9385b2e414527 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826916#comment-17826916
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1996262022

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  1s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 26s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 19s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 40s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 31s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/7/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 17s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 13s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 254 unchanged - 3 fixed = 254 total (was 257)  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 21s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 11s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 45s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 172m 13s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 30s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 276m 37s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.TestReconstructStripedFile |
   |   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 6b8506aa5106 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 44afa1412ebfda61c31d82d9c492e4aa044b6089 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/7/testReport/ |
   | Max. process+thread count | 4496 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/7/console |
   | versions | git=2.17.1 maven=3.6.0 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826908#comment-17826908
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-1996218189

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   4m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 27s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  10m  7s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   1m 32s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 13s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 17s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 59s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   1m 37s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/2/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 18s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  8s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m  8s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 25s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/2/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 1 new + 283 unchanged - 2 fixed = 
284 total (was 285)  |
   | +1 :green_heart: |  mvnsite  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 11s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   2m 49s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 10s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  76m 18s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 23s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 120m  6s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e513def84f1a 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826892#comment-17826892
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-1996070790

   Updated some tests, some of the tests are false positives so can be ignored 
if failing in main hadoop repo as well. 
   
   Test Name | With Changes(locally) | Without Changes
   

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826870#comment-17826870
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1995857608

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 42s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 17s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 29s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 40s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 28s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/6/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 35s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 12s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 254 unchanged - 3 fixed = 254 total (was 257)  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  
hadoop-hdfs-project/hadoop-hdfs-client generated 0 new + 1 unchanged - 1 fixed 
= 1 total (was 2)  |
   | +1 :green_heart: |  spotbugs  |   1m 53s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  shadedclient  |  22m  7s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 46s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 173m  2s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 32s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 277m 51s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
   |   | hadoop.hdfs.server.balancer.TestBalancer |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux ddde9c9242be 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 0bf2223760f96a5b370b2caf7e92fa95ba8a72de |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/6/testReport/ |
   | Max. 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826792#comment-17826792
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1994939750

   > @ritegarg Thank you for the PR. Overall looks good. Looks like `blanks` 
and `spotbugs` warnings are relevant. Can you please fix them? Once that is 
done, I will approve and merge.
   
   I see 2 spotbugs failures
   1. Possible null dereference in DFSOutputStream.java:[line 314] -> There is 
a PreConditions.checkNotNull before that in line 312
   2. Redundant Null Check of possible null values PeerCache.java:[line 158] -> 
Removed the null check
   
   Also removed blank line




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826780#comment-17826780
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1994802519

   @ritegarg  Thank you for the PR.
   Overall looks good.
   Looks like `blanks` and `spotbugs` warnings are relevant. Can you please fix 
them?
   Once that is done, I will approve and merge.
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824999#comment-17824999
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1987056450

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   3m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 18s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m  0s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 19s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 35s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 36s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 31s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/5/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 20s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 47s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 10s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 10s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/5/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   0m 34s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 249 unchanged - 3 fixed = 249 total (was 252)  |
   | +1 :green_heart: |  mvnsite  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 21s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 14s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 46s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 172m 33s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 32s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 281m 22s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 8885b87773d0 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / fa769f6cfa7845ef46da6a862c6ccab699956446 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/5/testReport/ |
   | Max. process+thread count | 4406 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
   | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824993#comment-17824993
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1987003317

   Ran the tests locally and observing that some of the failures are not 
reproducible and tests are passing. 1 test is failing with and without changes. 
1 test is flaky in both upstream and PR.
   
   
   Test Name | With Changes(locally) | Without Changes
   

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824992#comment-17824992
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1986999110

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   3m 51s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 43s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 41s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 15s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 39s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 29s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 37s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 29s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/4/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 23s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 13s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 249 unchanged - 3 fixed = 249 total (was 252)  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 16s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 46s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 174m 18s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 31s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 283m 20s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
   |   | hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | 
hadoop.hdfs.server.blockmanagement.TestReconstructStripedBlocksWithRackAwareness
 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux b7a1a36bac6b 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 39ce2d70cc78ba50870927ca7d2235fa25e00c66 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/4/testReport/ |
   | Max. process+thread count | 4387 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824531#comment-17824531
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1984392158

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  1s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 14s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 16s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 31s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 37s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 27s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/3/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m  3s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 23s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 13s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 249 unchanged - 3 fixed = 249 total (was 252)  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  3s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 47s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 121m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +0 :ok: |  asflicense  |   0m 27s |  |  ASF License check generated no 
output?  |
   |  |   | 225m 17s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor 
|
   |   | hadoop.hdfs.TestEncryptedTransfer |
   |   | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes |
   |   | hadoop.hdfs.server.datanode.TestBatchIbr |
   |   | hadoop.hdfs.server.datanode.TestBlockScanner |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics |
   |   | hadoop.hdfs.TestErasureCodingPolicyWithSnapshotWithRandomECPolicy |
   |   | hadoop.hdfs.server.datanode.TestDataNodeFaultInjector |
   |   | hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations |
   |   | hadoop.hdfs.server.datanode.TestBlockRecovery2 |
   |   | hadoop.hdfs.TestParallelUnixDomainRead |
   |   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
   |   | hadoop.hdfs.TestDFSStripedOutputStreamWithRandomECPolicy |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e6f838e8b04d 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824518#comment-17824518
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1984310623

   > There are few test failures. Can you please take a look? @ritegarg
   
   I was looking into the failures, looks like transient failures. The same 
tests are running fine locally. 




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824478#comment-17824478
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1984012754

   There are few test failures. Can you please take a look? @ritegarg 




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824348#comment-17824348
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1983180401

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 59s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 14s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 37s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 26s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/2/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  23m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m  9s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m  9s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 33s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 249 unchanged - 3 fixed = 249 total (was 252)  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  8s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 49s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 172m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 31s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 276m 57s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.mover.TestMover |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 8029685ad3de 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 5d4a6ed957d86f85618f70f27d11f6077336b16f |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/2/testReport/ |
   | Max. process+thread count | 4424 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
   | Console output | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824294#comment-17824294
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

Hexiaoqiao commented on PR #6613:
URL: https://github.com/apache/hadoop/pull/6613#issuecomment-1982836474

   Hi @ritegarg Thanks for your PR. branch-3.2 has been EOL. We should not 
submit PR to this branch. I will close this one. Please feel free to reopen it 
if something I missed. Thanks again.




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824293#comment-17824293
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

Hexiaoqiao closed pull request #6613: HDFS-17299. Adding rack failure tolerance 
when creating a new file  (…
URL: https://github.com/apache/hadoop/pull/6613




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824260#comment-17824260
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6614:
URL: https://github.com/apache/hadoop/pull/6614#issuecomment-1982453381

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   8m  5s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-2.10 Compile Tests _ |
   | +0 :ok: |  mvndep  |   2m 38s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  13m 36s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  compile  |   2m 12s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  compile  |   1m 53s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 53s |  |  branch-2.10 passed  |
   | +1 :green_heart: |  javadoc  |   1m 59s |  |  branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  branch-2.10 passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | -1 :x: |  spotbugs  |   2m 43s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/1/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs in branch-2.10 has 1 extant spotbugs 
warnings.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m  7s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javac  |   2m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  javac  |   1m 45s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 41s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/1/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 1 new + 274 unchanged - 2 fixed = 
275 total (was 276)  |
   | +1 :green_heart: |  mvnsite  |   1m 38s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 48s |  |  the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10  |
   | +1 :green_heart: |  javadoc  |   1m 15s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09  |
   | +1 :green_heart: |  spotbugs  |   4m 43s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 32s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  81m 56s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 147m 17s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.TestCrcCorruption |
   |   | hadoop.hdfs.TestDistributedFileSystem |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap |
   |   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion |
   |   | hadoop.hdfs.TestDFSInotifyEventInputStream |
   |   | hadoop.hdfs.tools.TestDFSAdminWithHA |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6614/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6614 |
   | Optional Tests | dupname asflicense compile javac 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824247#comment-17824247
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6613:
URL: https://github.com/apache/hadoop/pull/6613#issuecomment-1982313394

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   7m 59s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.2 Compile Tests _ |
   | +0 :ok: |  mvndep  |   4m  2s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  30m  2s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  compile  |   3m 24s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  checkstyle  |   1m  5s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  mvnsite  |   2m  9s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  javadoc  |   1m 44s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  spotbugs  |   5m 20s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  shadedclient  |  16m 32s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 48s |  |  Maven dependency ordering for patch  |
   | -1 :x: |  mvninstall  |   0m 32s | 
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in the patch failed.  |
   | -1 :x: |  compile  |   0m 33s | 
[/patch-compile-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/patch-compile-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project in the patch failed.  |
   | -1 :x: |  javac  |   0m 33s | 
[/patch-compile-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/patch-compile-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project in the patch failed.  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 50s |  |  the patch passed  |
   | -1 :x: |  mvnsite  |   0m 34s | 
[/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in the patch failed.  |
   | -1 :x: |  javadoc  |   0m 29s | 
[/results-javadoc-javadoc-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/results-javadoc-javadoc-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-project_hadoop-hdfs-client generated 1 new + 0 unchanged - 0 
fixed = 1 total (was 0)  |
   | -1 :x: |  spotbugs  |   0m 33s | 
[/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in the patch failed.  |
   | -1 :x: |  shadedclient  |   6m 49s |  |  patch has errors when building 
and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |   0m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in the patch failed.  |
   | -1 :x: |  unit  | 190m 53s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 284m 11s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestBPOfferService |
   |   | hadoop.hdfs.TestDistributedFileSystem |
   |   | hadoop.hdfs.server.balancer.TestBalancer |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6613/1/artifact/out/Dockerfile
 |
   | GITHUB 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824243#comment-17824243
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6612:
URL: https://github.com/apache/hadoop/pull/6612#issuecomment-1982297366

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 46s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 20s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   2m 20s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 34s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 36s |  |  branch-3.3 passed  |
   | -1 :x: |  spotbugs  |   1m 28s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/1/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  22m 11s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 48s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 11s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 249 unchanged - 3 fixed = 249 total (was 252)  |
   | +1 :green_heart: |  mvnsite  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 16s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 57s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 46s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 172m  4s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 31s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 276m 34s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6612 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 2bf19f76095a 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / f4fad11cc92defc7568c0658648dfc04899ff180 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/1/testReport/ |
   | Max. process+thread count | 4872 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6612/1/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824238#comment-17824238
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg opened a new pull request, #6614:
URL: https://github.com/apache/hadoop/pull/6614

   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824216#comment-17824216
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg opened a new pull request, #6613:
URL: https://github.com/apache/hadoop/pull/6613

   …#6566)
   
   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824209#comment-17824209
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg opened a new pull request, #6612:
URL: https://github.com/apache/hadoop/pull/6612

   
   
   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread Shilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824205#comment-17824205
 ] 

Shilun Fan commented on HDFS-17299:
---

Don't worry, hadoop-3.4.0-RC3 is a release packaged using the tag 
[release-3.4.0-RC3|[https://github.com/apache/hadoop/releases/tag/release-3.4.0-RC3]],
 and this commit has no impact on hadoop-3.4.0-RC3.  We usually backport to 
branch-3.4.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824197#comment-17824197
 ] 

Rushabh Shah commented on HDFS-17299:
-

[~slfan1989] By mistake, I cherry-picked this change to branch-3.4.0. I was not 
aware that you are trying to do an RC from this branch.
This is the commit: 
https://github.com/apache/hadoop/commit/f62f116b75dc999c2d4a1824cf099c34e64e74aa

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824194#comment-17824194
 ] 

Rushabh Shah commented on HDFS-17299:
-

Thank you [~gargrite]  for the PR.

The patch applied cleanly to trunk, branch-3.4 and branch-3.4.0
There are some conflicts applying it to branch-3.3, branch-3.2 and branch-2.10. 
Can you please create PR for these branches?
Thank you.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824176#comment-17824176
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 merged PR #6566:
URL: https://github.com/apache/hadoop/pull/6566




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824086#comment-17824086
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1981347404

   Will merge the PR later today. FYI.




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823916#comment-17823916
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1980368491

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 18s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  37m  2s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m  6s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   5m 54s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 34s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 17s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   2m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 40s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  41m  8s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  41m 30s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   5m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 42s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 22s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 243 unchanged - 3 fixed = 243 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m  3s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   2m  5s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 58s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  41m 11s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 22s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 257m  8s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 45s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 449m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6566 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823823#comment-17823823
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1979910458

   > There are some new checkstyle issues from the result of CI. Could you fix 
them?
   
   Should be fixed now




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823820#comment-17823820
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

tasanuma commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1979904394

   There are some new checkstyle issues from the result of CI. Could you fix 
them?




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823736#comment-17823736
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1979352026

   1. 
[spotbugs](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 warning is not related to patch. So ignoring for now.
   2. There are 3 test failures, 2 of them 
([TestBlockListAsLongs.testFuzz](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/testReport/junit/org.apache.hadoop.hdfs.protocol/TestBlockListAsLongs/testFuzz/)
 and 
[TestLargeBlockReport.testBlockReportSucceedsWithLargerLengthLimit](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestLargeBlockReport/testBlockReportSucceedsWithLargerLengthLimit/))
 are consistently failing in nightly builds.
   3. The third test failure 
([TestDFSAdmin.testDecommissionDataNodesReconfig](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/testReport/junit/org.apache.hadoop.hdfs.tools/TestDFSAdmin/testDecommissionDataNodesReconfig/))
 is flaky. Created 
[HDFS-17409](https://issues.apache.org/jira/browse/HDFS-17409) for further 
investigation.
   
   This PR is ready to review again. All the comments are addressed by 
@ritegarg. If there are no more comment by EOD tomorrow, I will merge this PR. 
Cc @ayushtkn  @tasanuma 
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823715#comment-17823715
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1513209226


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java:
##
@@ -89,6 +89,10 @@ public void testExcludedNodesForgiveness() throws 
IOException {
 conf.setLong(
 HdfsClientConfigKeys.Write.EXCLUDE_NODES_CACHE_EXPIRY_INTERVAL_KEY,
 2500);
+// Set min replication for blocks to be written as 1.
+conf.setInt(
+
HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.MIN_REPLICATION,
+1);

Review Comment:
   Fixed this behavior by adding try/catch to 
DataStreamer.setupPipelineForCreate





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823631#comment-17823631
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

tasanuma commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1512884161


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1817,10 +1839,10 @@ protected LocatedBlock nextBlockOutputStream() throws 
IOException {
   nodes = lb.getLocations();
   nextStorageTypes = lb.getStorageTypes();
   nextStorageIDs = lb.getStorageIDs();
-
+  setPipeline(lb);
   // Connect to first DataNode in the list.
   success = createBlockOutputStream(nodes, nextStorageTypes, 
nextStorageIDs,
-  0L, false);
+  0L, false) || setupPipelineForAppendOrRecovery();

Review Comment:
   I haven't looked into the PR in detail, but it makes sense to me that 
PIPELINE_SETUP_CREATE should also consider the 
`dtpReplaceDatanodeOnFailureReplication`. If I understand correctly, this 
change won't affect users who have set 
`dfs.client.block.write.replace-datanode-on-failure.min-replication(=dtpReplaceDatanodeOnFailureReplication)=0`,
 which is the default setting, so I think it's fairly safe.





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823591#comment-17823591
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1978665926

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m  0s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 20s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   2m 56s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   2m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   1m 26s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  21m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  22m 13s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   2m 49s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   2m 42s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 36s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 5 new + 243 unchanged - 2 fixed = 
248 total (was 245)  |
   | +1 :green_heart: |  mvnsite  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m  5s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 26s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 49s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 199m  4s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 29s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 307m 29s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/20/artifact/out/Dockerfile
 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823399#comment-17823399
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1977869072

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  2s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m  5s |  |  Maven dependency ordering for branch  |
   | -1 :x: |  mvninstall  |   0m 25s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | -1 :x: |  compile  |   0m 25s | 
[/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs-project in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  compile  |   0m 23s | 
[/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs-project in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -0 :warning: |  checkstyle  |   0m 21s | 
[/buildtool-branch-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/buildtool-branch-checkstyle-hadoop-hdfs-project.txt)
 |  The patch fails to run checkstyle in hadoop-hdfs-project  |
   | -1 :x: |  mvnsite  |   0m 23s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in trunk failed.  |
   | -1 :x: |  mvnsite  |   0m 23s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  javadoc  |   0m 21s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs-client in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  javadoc  |   0m 22s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  javadoc  |   0m 23s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs-client in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -1 :x: |  javadoc  |   0m 22s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -1 :x: |  spotbugs  |   0m 23s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/18/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in trunk failed.  |
   | -1 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823377#comment-17823377
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1977657875

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 55s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |   4m  2s |  |  Maven dependency ordering for branch  |
   | -1 :x: |  mvninstall  |   0m 34s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | -1 :x: |  compile  |   2m 25s | 
[/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs-project in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  compile  |   0m 23s | 
[/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs-project in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -0 :warning: |  checkstyle  |   0m 21s | 
[/buildtool-branch-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/buildtool-branch-checkstyle-hadoop-hdfs-project.txt)
 |  The patch fails to run checkstyle in hadoop-hdfs-project  |
   | -1 :x: |  mvnsite  |   0m 59s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  mvnsite  |   0m 24s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in trunk failed.  |
   | -1 :x: |  javadoc  |   0m 40s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  javadoc  |   0m 23s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs-client in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  javadoc  |   0m 24s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -1 :x: |  spotbugs  |   0m 14s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  spotbugs  |   3m 32s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/16/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | -1 :x: |  shadedclient  |   4m  8s |  |  branch has errors when building 
and testing our client artifacts. 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823373#comment-17823373
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1977608476

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  0s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m  6s |  |  Maven dependency ordering for branch  |
   | -1 :x: |  mvninstall  |  42m 15s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | -1 :x: |  compile  |   2m 39s | 
[/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs-project in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  compile  |   8m 32s | 
[/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs-project in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | +1 :green_heart: |  checkstyle  |   2m 13s |  |  trunk passed  |
   | -1 :x: |  mvnsite  |   0m 30s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  javadoc  |   0m 31s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs-client in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  javadoc  |   0m 30s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  javadoc  |   0m 33s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -1 :x: |  spotbugs  |   0m 27s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in trunk failed.  |
   | -1 :x: |  spotbugs  |   0m 24s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | +1 :green_heart: |  shadedclient  |   8m 39s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |   9m  7s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 17s |  |  Maven dependency ordering for patch  |
   | -1 :x: |  mvninstall  |   0m 16s | 
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/15/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823336#comment-17823336
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ayushtkn commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1511777291


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1817,10 +1839,10 @@ protected LocatedBlock nextBlockOutputStream() throws 
IOException {
   nodes = lb.getLocations();
   nextStorageTypes = lb.getStorageTypes();
   nextStorageIDs = lb.getStorageIDs();
-
+  setPipeline(lb);
   // Connect to first DataNode in the list.
   success = createBlockOutputStream(nodes, nextStorageTypes, 
nextStorageIDs,
-  0L, false);
+  0L, false) || setupPipelineForAppendOrRecovery();

Review Comment:
   hmm, to me it looks fine only, I am not sure about any legacy reasons or any 
use case which can break due to this. If other folks are ok, should be fine.
   
   @Hexiaoqiao / @tasanuma mind giving a check





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823254#comment-17823254
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1511445056


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1817,10 +1839,10 @@ protected LocatedBlock nextBlockOutputStream() throws 
IOException {
   nodes = lb.getLocations();
   nextStorageTypes = lb.getStorageTypes();
   nextStorageIDs = lb.getStorageIDs();
-
+  setPipeline(lb);
   // Connect to first DataNode in the list.
   success = createBlockOutputStream(nodes, nextStorageTypes, 
nextStorageIDs,
-  0L, false);
+  0L, false) || setupPipelineForAppendOrRecovery();

Review Comment:
   > If I catch it right, it is a behaviour change, right?
   
   I agree this is a behavior change but the current behavior is buggy. 
`dtpReplaceDatanodeOnFailureReplication`  configuration is NOT honored during 
the `PIPELINE_SETUP_CREATE` phase. Thats why this jira was created. 
   After this patch, we will handle datanode failure during  
`PIPELINE_SETUP_CREATE` in the same way as `DATA_STREAMING` phase and the 
application (in this case hbase) will experience consistent behavior using hdfs 
client.
   Cc @ritegarg  @ayushtkn 





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823013#comment-17823013
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1975518985

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  17m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 56s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  36m 26s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m 52s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   6m 51s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 48s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 33s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 15s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   2m 20s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 41s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/14/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  40m 36s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  40m 59s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 59s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   5m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 38s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 38s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 18s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/14/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 5 new + 244 unchanged - 2 fixed = 
249 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m  5s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   2m  5s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 59s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 29s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 22s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 249m 35s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/14/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 459m 20s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822973#comment-17822973
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1510344410


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripedDataStreamer.java:
##
@@ -111,6 +111,7 @@ protected LocatedBlock nextBlockOutputStream() throws 
IOException {
   final DatanodeInfo badNode = nodes[getErrorState().getBadNodeIndex()];
   LOG.warn("Excluding datanode " + badNode);
   excludedNodes.put(badNode, badNode);
+  setPipeline(null, null, null);

Review Comment:
   Updated





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822974#comment-17822974
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1510344496


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -414,6 +414,10 @@ synchronized void markFirstNodeIfNotMarked() {
 }
 
 synchronized void adjustState4RestartingNode() {
+  if (restartingNodeIndex == -1) {
+return;
+  }
+

Review Comment:
   Thanks for your commit, resolving this





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822881#comment-17822881
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1510154387


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -414,6 +414,10 @@ synchronized void markFirstNodeIfNotMarked() {
 }
 
 synchronized void adjustState4RestartingNode() {
+  if (restartingNodeIndex == -1) {
+return;
+  }
+

Review Comment:
   
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6605/2/testReport/ 
Still got some failures with Iterables approach





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822786#comment-17822786
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1974733389

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 38s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  36m  8s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m  4s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   5m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 20s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 40s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/13/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  39m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  40m 20s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 57s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   5m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 39s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 39s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 19s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/13/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 5 new + 244 unchanged - 2 fixed = 
249 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m  2s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  0s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  42m 22s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 20s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 253m 54s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/13/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 444m 43s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822762#comment-17822762
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ayushtkn commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1509893467


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -414,6 +414,10 @@ synchronized void markFirstNodeIfNotMarked() {
 }
 
 synchronized void adjustState4RestartingNode() {
+  if (restartingNodeIndex == -1) {
+return;
+  }
+

Review Comment:
   Thanx, I checked the PR, the failure is due to ArrayIndexBound in our new 
code
   ```
   Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1848)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:751)
   ```
   
   Because we need the badNode & we are extracting like:
   ```
   final DatanodeInfo badNode = nodes[errorState.getBadNodeIndex()];
   ```
   The badNodeIndex is reset to -1, post this check, but the node is stored in 
the ``failed`` arraylist, can you extract the badNode from the ``failed`` 
ArrayList rather than tweaking this logic?
   
   Maybe something like this
   ```
   final DatanodeInfo badNode = Iterables.getLast(failed);
   ```





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822759#comment-17822759
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ayushtkn commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1509891173


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripedDataStreamer.java:
##
@@ -111,6 +111,7 @@ protected LocatedBlock nextBlockOutputStream() throws 
IOException {
   final DatanodeInfo badNode = nodes[getErrorState().getBadNodeIndex()];
   LOG.warn("Excluding datanode " + badNode);
   excludedNodes.put(badNode, badNode);
+  setPipeline(null, null, null);

Review Comment:
   That is in {{DataStreamer.java}} right? Is it being called for 
StripedDataStreamer.java as well?





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822735#comment-17822735
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1509773824


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -2259,4 +2282,4 @@ public String toString() {
 return extendedBlock == null ?
 "block==null" : "" + extendedBlock.getLocalBlock();
   }
-}
+}

Review Comment:
   Fixed





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822734#comment-17822734
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1509772592


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripedDataStreamer.java:
##
@@ -111,6 +111,7 @@ protected LocatedBlock nextBlockOutputStream() throws 
IOException {
   final DatanodeInfo badNode = nodes[getErrorState().getBadNodeIndex()];
   LOG.warn("Excluding datanode " + badNode);
   excludedNodes.put(badNode, badNode);
+  setPipeline(null, null, null);

Review Comment:
   The pipeline which is set here is used in setupPipelineForAppendOrRecovery 
which will now be used in this flow. If that also fails, we set it to null to 
clear the existing state. 





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822716#comment-17822716
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1509591909


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -414,6 +414,10 @@ synchronized void markFirstNodeIfNotMarked() {
 }
 
 synchronized void adjustState4RestartingNode() {
+  if (restartingNodeIndex == -1) {
+return;
+  }
+

Review Comment:
   From past, some tests were failing in absence of this check(I lost track of 
that PR), I started a new PR to check the exact failures 
https://github.com/apache/hadoop/pull/6605/files





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822710#comment-17822710
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ayushtkn commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1509482598


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -414,6 +414,10 @@ synchronized void markFirstNodeIfNotMarked() {
 }
 
 synchronized void adjustState4RestartingNode() {
+  if (restartingNodeIndex == -1) {
+return;
+  }
+

Review Comment:
   Why is this needed?
   Below there is a logic, doesn't that take care of things?
   ```
 if (!isRestartingNode()) {
   error = ErrorType.NONE;
 }
 badNodeIndex = -1;
   }
   ```
   
None of your tests fails without this for me



##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -2259,4 +2282,4 @@ public String toString() {
 return extendedBlock == null ?
 "block==null" : "" + extendedBlock.getLocalBlock();
   }
-}
+}

Review Comment:
   nit
   unrelated change, pls avoid



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,147 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+}
+  }
+
+  @Test
+  public void 
testSingleRackFailureDuringPipelineSetupMinReplicationImpossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 3);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  boolean threw = false;
+  try {
+DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  } catch (IOException e) {
+// success
+threw = true;
+  }
+  assertTrue("Failed to throw IOE when creating a file with less "
+  + "DNs than required for min replication", threw);

Review Comment:
   Use ``LambdaTestsUtils.intercept``



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,147 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822394#comment-17822394
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1972572072

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m  0s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  39m 53s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m 15s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   6m  3s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 23s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 12s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 54s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/12/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  44m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  44m 57s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   1m  5s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 59s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   6m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   6m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 28s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/12/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 3 new + 244 unchanged - 2 fixed = 
247 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 43s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m 50s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  44m  5s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 28s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 266m 49s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/12/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 44s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 472m  0s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822364#comment-17822364
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1972468090

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 43s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  37m  8s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m  4s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   5m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 18s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 20s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 38s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/11/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  40m 58s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  41m 19s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   6m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m  0s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   6m  0s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 22s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/11/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 3 new + 244 unchanged - 2 fixed = 
247 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m  7s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  3s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  42m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 27s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 262m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/11/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 56s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 456m 47s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822316#comment-17822316
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508255658


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,154 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);
+}
+  }
+
+  @Test
+  public void 
testSingleRackFailureDuringPipelineSetupMinReplicationImpossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 3);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  boolean threw = false;
+  try {
+DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  } catch (IOException e) {
+// success
+threw = true;
+  } finally {
+cluster.shutdown(true);
+  }
+  assertTrue("Failed to throw IOE when creating a file with less "
+  + "DNs than required for min replication", threw);
+}
+  }
+
+  @Test
+  public void 
testMultipleRackFailureDuringPipelineSetupMinReplicationPossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 1);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill 2 DN, so only 1 racks stays with active DN
+  cluster.stopDataNode(0);
+  cluster.stopDataNode(1);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);

Review Comment:
   Try with resource handles that. We don't need cluster.shutdown here. See 
[AutoCloseable](https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html)
 and 
[close](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java#L3564).
 I removed it from tests that I added





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822317#comment-17822317
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508255776


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1618,33 +1625,47 @@ private void setupPipelineForAppendOrRecovery() throws 
IOException {
   LOG.warn(msg);
   lastException.set(new IOException(msg));
   streamerClosed = true;
-  return;
+  return false;
 }
-setupPipelineInternal(nodes, storageTypes, storageIDs);
+return setupPipelineInternal(nodes, storageTypes, storageIDs);
   }
 
-  protected void setupPipelineInternal(DatanodeInfo[] datanodes,
+  protected boolean setupPipelineInternal(DatanodeInfo[] datanodes,
   StorageType[] nodeStorageTypes, String[] nodeStorageIDs)
   throws IOException {
 boolean success = false;
 long newGS = 0L;
+boolean isCreateStage = BlockConstructionStage.PIPELINE_SETUP_CREATE == 
stage;
 while (!success && !streamerClosed && dfsClient.clientRunning) {
   if (!handleRestartingDatanode()) {
-return;
+return false;
   }
 
-  final boolean isRecovery = errorState.hasInternalError();
+  final boolean isRecovery = errorState.hasInternalError() && 
!isCreateStage;
+
+
   if (!handleBadDatanode()) {
-return;
+return false;
   }
 
   handleDatanodeReplacement();
 
+  // During create stage, if we remove a node (nodes.length - 1)

Review Comment:
   Updated





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822314#comment-17822314
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508253750


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,154 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);

Review Comment:
   Try with resource handles that. We don't need cluster.shutdown here. See 
[AutoCloseable](https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html)
 and 
[close](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java#L3564)





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822312#comment-17822312
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508229688


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1618,33 +1625,47 @@ private void setupPipelineForAppendOrRecovery() throws 
IOException {
   LOG.warn(msg);
   lastException.set(new IOException(msg));
   streamerClosed = true;
-  return;
+  return false;
 }
-setupPipelineInternal(nodes, storageTypes, storageIDs);
+return setupPipelineInternal(nodes, storageTypes, storageIDs);
   }
 
-  protected void setupPipelineInternal(DatanodeInfo[] datanodes,
+  protected boolean setupPipelineInternal(DatanodeInfo[] datanodes,
   StorageType[] nodeStorageTypes, String[] nodeStorageIDs)
   throws IOException {
 boolean success = false;
 long newGS = 0L;
+boolean isCreateStage = BlockConstructionStage.PIPELINE_SETUP_CREATE == 
stage;
 while (!success && !streamerClosed && dfsClient.clientRunning) {
   if (!handleRestartingDatanode()) {
-return;
+return false;
   }
 
-  final boolean isRecovery = errorState.hasInternalError();
+  final boolean isRecovery = errorState.hasInternalError() && 
!isCreateStage;
+
+
   if (!handleBadDatanode()) {
-return;
+return false;
   }
 
   handleDatanodeReplacement();
 
+  // During create stage, if we remove a node (nodes.length - 1)

Review Comment:
   I think this comment needs to be updated.



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,154 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);
+}
+  }
+
+  @Test
+  public void 
testSingleRackFailureDuringPipelineSetupMinReplicationImpossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 3);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  boolean threw = false;
+  try {
+DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  } catch (IOException e) {
+// success
+threw = true;
+  } finally {
+cluster.shutdown(true);
+  }
+  assertTrue("Failed to throw IOE when creating a file with less "
+  + "DNs than required for min replication", threw);
+}
+  }
+
+  @Test
+  public void 
testMultipleRackFailureDuringPipelineSetupMinReplicationPossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821396#comment-17821396
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg closed pull request #6556: HDFS-17299. Adding rack failure tolerance 
when creating a new file
URL: https://github.com/apache/hadoop/pull/6556




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821141#comment-17821141
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1966180516

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 40s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  35m 52s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m  1s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   5m 47s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 19s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 40s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/10/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  40m  4s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  40m 25s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 30s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  0s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   5m 54s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 38s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 38s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/10/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | -0 :warning: |  checkstyle  |   1m 21s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/10/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 17 new + 244 unchanged - 2 fixed = 
261 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m  0s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  5s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 56s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 12s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 22s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 249m  3s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/10/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 437m 32s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820918#comment-17820918
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1965614194

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 48s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 41s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  35m 55s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m  3s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   5m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 17s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 49s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 39s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/9/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  41m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  42m 29s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 32s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   6m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 55s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 55s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 19s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/9/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 17 new + 244 unchanged - 2 fixed = 
261 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m  2s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  3s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 57s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 30s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 23s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 256m 54s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/9/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 448m 50s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820792#comment-17820792
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1964720571

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m  0s |  |  Docker mode activated.  |
   | -1 :x: |  patch  |   0m 22s |  |  
https://github.com/apache/hadoop/pull/6566 does not apply to trunk. Rebase 
required? Wrong Branch? See 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.  
|
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | GITHUB PR | https://github.com/apache/hadoop/pull/6566 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/8/console |
   | versions | git=2.34.1 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819990#comment-17819990
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1960990799

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  20m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 45s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  37m 15s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  21m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  18m 51s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   5m 14s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 26s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 56s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 41s |  |  branch/hadoop-project no spotbugs 
output file (spotbugsXml.xml)  |
   | -1 :x: |  spotbugs  |   2m 52s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/7/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | -1 :x: |  shadedclient  |   5m 15s |  |  branch has errors when building 
and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  18m 31s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  18m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  17m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m 41s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/7/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 17 new + 244 unchanged - 2 fixed = 261 total (was 
246)  |
   | +1 :green_heart: |  mvnsite  |   3m 31s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  1s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   2m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 37s |  |  hadoop-project has no data from 
spotbugs  |
   | +1 :green_heart: |  shadedclient  |  39m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 37s |  |  hadoop-project in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   2m 47s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 275m 42s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 14s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 525m 10s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819858#comment-17819858
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1960584739

   (!) A patch to the testing environment has been detected. 
   Re-executing against the patched versions to perform further tests. 
   The console is at 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/7/console in 
case of problems.
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819825#comment-17819825
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1960345982

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 55s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m  6s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m 39s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 21s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m  6s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   4m 22s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 23s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m  1s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 45s |  |  branch/hadoop-project no spotbugs 
output file (spotbugsXml.xml)  |
   | -1 :x: |  spotbugs  |   2m 45s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/6/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | -1 :x: |  shadedclient  |   5m  4s |  |  branch has errors when building 
and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 34s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m  1s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  17m  1s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  15m 58s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  15m 58s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/6/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | -0 :warning: |  checkstyle  |   4m 19s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/6/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 17 new + 243 unchanged - 2 fixed = 260 total (was 
245)  |
   | +1 :green_heart: |  mvnsite  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  0s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   2m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 39s |  |  hadoop-project has no data from 
spotbugs  |
   | +1 :green_heart: |  shadedclient  |  42m 29s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 33s |  |  hadoop-project in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   3m  0s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  |  17m 33s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | -1 :x: |  asflicense  |   0m 54s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/6/artifact/out/results-asflicense.txt)
 |  The patch generated 44 ASF 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819741#comment-17819741
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1959924123

   (!) A patch to the testing environment has been detected. 
   Re-executing against the patched versions to perform further tests. 
   The console is at 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/6/console in 
case of problems.
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819629#comment-17819629
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1959333777

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |   5m 14s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  29m 11s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   9m 27s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   8m 42s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   2m 19s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m  5s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 33s |  |  branch/hadoop-project no spotbugs 
output file (spotbugsXml.xml)  |
   | -1 :x: |  spotbugs  |   1m 33s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/5/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | -1 :x: |  shadedclient  |   2m 35s |  |  branch has errors when building 
and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   9m  1s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   9m  1s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m 41s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   8m 41s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   2m  9s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/5/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 17 new + 243 unchanged - 2 fixed = 260 total (was 
245)  |
   | +1 :green_heart: |  mvnsite  |   2m  2s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  0s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   1m 47s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 28s |  |  hadoop-project has no data from 
spotbugs  |
   | +1 :green_heart: |  shadedclient  |  22m 45s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 17s |  |  hadoop-project in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   1m 49s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 219m  2s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 48s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 347m 35s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819488#comment-17819488
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1958788319

   (!) A patch to the testing environment has been detected. 
   Re-executing against the patched versions to perform further tests. 
   The console is at 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/5/console in 
case of problems.
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819480#comment-17819480
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1958776146

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 20s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m 46s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 29s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m  5s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   4m 55s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 37s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 45s |  |  branch/hadoop-project no spotbugs 
output file (spotbugsXml.xml)  |
   | -1 :x: |  spotbugs  |   2m 45s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/4/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | -1 :x: |  shadedclient  |   4m 59s |  |  branch has errors when building 
and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 35s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 42s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  16m 42s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  16m  2s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/4/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | -0 :warning: |  checkstyle  |   4m 12s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/4/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 17 new + 243 unchanged - 2 fixed = 260 total (was 
245)  |
   | +1 :green_heart: |  mvnsite  |   3m 30s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  1s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   2m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 40s |  |  hadoop-project has no data from 
spotbugs  |
   | +1 :green_heart: |  shadedclient  |  36m  5s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 38s |  |  hadoop-project in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   2m 44s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 224m 56s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 13s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 434m 44s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819409#comment-17819409
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1958262110

   (!) A patch to the testing environment has been detected. 
   Re-executing against the patched versions to perform further tests. 
   The console is at 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/4/console in 
case of problems.
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819078#comment-17819078
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1955918130

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m 24s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 56s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m 32s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 33s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m  8s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   4m 23s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 29s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 44s |  |  branch/hadoop-project no spotbugs 
output file (spotbugsXml.xml)  |
   | -1 :x: |  spotbugs  |   2m 44s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/3/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | -1 :x: |  shadedclient  |   4m 56s |  |  branch has errors when building 
and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 35s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 16s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 40s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  16m 40s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  15m 51s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  15m 51s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m 17s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/3/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 17 new + 244 unchanged - 2 fixed = 261 total (was 
246)  |
   | +1 :green_heart: |  mvnsite  |   3m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  1s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   2m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 39s |  |  hadoop-project has no data from 
spotbugs  |
   | +1 :green_heart: |  shadedclient  |  35m 53s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 37s |  |  hadoop-project in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   2m 48s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 269m 58s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 10s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 478m 49s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819040#comment-17819040
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1955745685

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 51s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 41s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m 22s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m  8s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   4m 24s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 28s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 44s |  |  branch/hadoop-project no spotbugs 
output file (spotbugsXml.xml)  |
   | -1 :x: |  spotbugs  |   2m 44s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/2/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | -1 :x: |  shadedclient  |   5m  2s |  |  branch has errors when building 
and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 34s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  16m 50s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  16m  2s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m 20s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/2/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 16 new + 243 unchanged - 2 fixed = 259 total (was 
245)  |
   | +1 :green_heart: |  mvnsite  |   3m 29s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  0s |  |  No new issues.  |
   | +1 :green_heart: |  javadoc  |   2m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   3m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +0 :ok: |  spotbugs  |   0m 39s |  |  hadoop-project has no data from 
spotbugs  |
   | +1 :green_heart: |  shadedclient  |  35m 42s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 39s |  |  hadoop-project in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   2m 43s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 225m 33s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 12s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 444m 45s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   
   
   | Subsystem | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818996#comment-17818996
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1955129523

   (!) A patch to the testing environment has been detected. 
   Re-executing against the patched versions to perform further tests. 
   The console is at 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/3/console in 
case of problems.
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818947#comment-17818947
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1954866984

   (!) A patch to the testing environment has been detected. 
   Re-executing against the patched versions to perform further tests. 
   The console is at 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/2/console in 
case of problems.
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818939#comment-17818939
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg opened a new pull request, #6566:
URL: https://github.com/apache/hadoop/pull/6566

   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818084#comment-17818084
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6556:
URL: https://github.com/apache/hadoop/pull/6556#issuecomment-1949219422

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 23s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 18s |  |  Maven dependency ordering for branch  |
   | -1 :x: |  mvninstall  |   0m 21s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | -1 :x: |  compile  |   0m 21s | 
[/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-compile-hadoop-hdfs-project-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-hdfs-project in trunk failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  compile  |   3m 22s | 
[/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-compile-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs-project in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -0 :warning: |  checkstyle  |   0m 19s | 
[/buildtool-branch-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/buildtool-branch-checkstyle-hadoop-hdfs-project.txt)
 |  The patch fails to run checkstyle in hadoop-hdfs-project  |
   | -1 :x: |  mvnsite  |   0m 21s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  mvnsite  |   0m 22s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-client.txt)
 |  hadoop-hdfs-client in trunk failed.  |
   | -1 :x: |  javadoc  |   1m 22s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  javadoc  |   0m 21s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-hdfs-client in trunk failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  javadoc  |   0m 21s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -1 :x: |  javadoc  |   0m 22s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs-client in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -1 :x: |  spotbugs  |   0m 21s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6556/2/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  spotbugs  |   0m 20s | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817773#comment-17817773
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on PR #6513:
URL: https://github.com/apache/hadoop/pull/6513#issuecomment-1947097875

   Closing this PR in favor of https://github.com/apache/hadoop/pull/6556/files




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817772#comment-17817772
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg closed pull request #6513: HDFS-17299. Adding rack failure tolerance 
when creating a new file
URL: https://github.com/apache/hadoop/pull/6513




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817769#comment-17817769
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg opened a new pull request, #6556:
URL: https://github.com/apache/hadoop/pull/6556

   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815526#comment-17815526
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6513:
URL: https://github.com/apache/hadoop/pull/6513#issuecomment-1933421772

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 46s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  35m  1s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m 12s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   5m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 19s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 19s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  1s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m 13s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   1m  2s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m  0s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   6m  0s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 37s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 37s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 21s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6513/9/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 16 new + 243 unchanged - 2 fixed = 
259 total (was 245)  |
   | +1 :green_heart: |  mvnsite  |   2m  3s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  7s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  5s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 42s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 25s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 257m 45s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6513/9/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 446m  4s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate |
   |   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
   |   | hadoop.hdfs.TestDFSClientExcludedNodes |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6513/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6513 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 41aa593ac88e 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815033#comment-17815033
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6513:
URL: https://github.com/apache/hadoop/pull/6513#discussion_r1479041865


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1607,8 +1607,11 @@ private void transfer(final DatanodeInfo src, final 
DatanodeInfo[] targets,
* it can be written to.
* This happens when a file is appended or data streaming fails
* It keeps on trying until a pipeline is setup
+   *
+   * Returns boolean whether pipeline was setup successfully or not.
+   * This boolean is used upstream on whether to continue creating pipeline or 
throw exception
*/
-  private void setupPipelineForAppendOrRecovery() throws IOException {
+  private boolean setupPipelineForAppendOrRecovery() throws IOException {

Review Comment:
   We are changing the return type of `setupPipelineForAppendOrRecovery` and 
`setupPipelineInternal` methods.
   IIUC this is the reason: `handleBadDatanode` can silently fail to handle bad 
datanode 
[here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1700-L1706)
 and `setupPipelineInternal` will silently return 
[here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1637-L1638)
 without bubbling up the exception. 
   





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814924#comment-17814924
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6513:
URL: https://github.com/apache/hadoop/pull/6513#discussion_r1480341310


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1618,24 +1621,33 @@ private void setupPipelineForAppendOrRecovery() throws 
IOException {
   LOG.warn(msg);
   lastException.set(new IOException(msg));
   streamerClosed = true;
-  return;
+  return false;
 }
-setupPipelineInternal(nodes, storageTypes, storageIDs);
+return setupPipelineInternal(nodes, storageTypes, storageIDs);
   }
 
-  protected void setupPipelineInternal(DatanodeInfo[] datanodes,
+  protected boolean setupPipelineInternal(DatanodeInfo[] datanodes,
   StorageType[] nodeStorageTypes, String[] nodeStorageIDs)
   throws IOException {
 boolean success = false;
 long newGS = 0L;
+boolean isCreateStage = BlockConstructionStage.PIPELINE_SETUP_CREATE == 
stage;
 while (!success && !streamerClosed && dfsClient.clientRunning) {
   if (!handleRestartingDatanode()) {
-return;
+return false;
+  }
+
+  final boolean isRecovery = errorState.hasInternalError() && 
!isCreateStage;
+
+  // During create stage, if we remove a node (nodes.length - 1)
+  //  min replication should still be satisfied.
+  if (isCreateStage && !(dfsClient.dtpReplaceDatanodeOnFailureReplication 
> 0 &&

Review Comment:
   Thinking about the case where we are in PIPELINE_SETUP_CREATE stage but 
isAppend is set to true, then it will not exit early from 
addDatanode2ExistingPipeline method 
[here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1489-L1492)
 
   
   Assuming the `replication factor is 3` and 
`dfs.client.block.write.replace-datanode-on-failure.min-replication is set to 
3` and there is 1 bad node in the pipeline and there are valid nodes in the 
cluster, this patch will return false early. 
   I think we should move this check after `handleDatanodeReplacement` method. 
@ritegarg 





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814919#comment-17814919
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6513:
URL: https://github.com/apache/hadoop/pull/6513#discussion_r1479057807


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1618,24 +1621,33 @@ private void setupPipelineForAppendOrRecovery() throws 
IOException {
   LOG.warn(msg);
   lastException.set(new IOException(msg));
   streamerClosed = true;
-  return;
+  return false;
 }
-setupPipelineInternal(nodes, storageTypes, storageIDs);
+return setupPipelineInternal(nodes, storageTypes, storageIDs);
   }
 
-  protected void setupPipelineInternal(DatanodeInfo[] datanodes,
+  protected boolean setupPipelineInternal(DatanodeInfo[] datanodes,
   StorageType[] nodeStorageTypes, String[] nodeStorageIDs)
   throws IOException {
 boolean success = false;
 long newGS = 0L;
+boolean isCreateStage = BlockConstructionStage.PIPELINE_SETUP_CREATE == 
stage;
 while (!success && !streamerClosed && dfsClient.clientRunning) {
   if (!handleRestartingDatanode()) {
-return;
+return false;
+  }
+
+  final boolean isRecovery = errorState.hasInternalError() && 
!isCreateStage;
+
+  // During create stage, if we remove a node (nodes.length - 1)
+  //  min replication should still be satisfied.
+  if (isCreateStage && !(dfsClient.dtpReplaceDatanodeOnFailureReplication 
> 0 &&

Review Comment:
   Reason behind adding this check here:
   We are already doing this check in catch block of 
`addDatanode2ExistingPipeline` method 
[here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1528-L1539).
   But when `isAppend` flag is set to `false` and we are in 
`PIPELINE_SETUP_CREATE` phase, we exit early from 
`addDatanode2ExistingPipeline` method 
[here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1489-L1492)
   Irrespective of ReplaceDatanodeOnFailure policy, we will NEVER add a new 
datanode to the pipeline during PIPELINE_SETUP_CREATE stage and if removing one 
bad datanode is going to violate 
`dfs.client.block.write.replace-datanode-on-failure.min-replication` property 
then we should exit early.





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)

  1   2   >