[
https://issues.apache.org/jira/browse/HDFS-16064?focusedWorklogId=782169&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-782169
]
ASF GitHub Bot logged work on HDFS-16064:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 16/Jun/22 21:37
Start Date: 16/Jun/22 21:37
Worklog Time Spent: 10m
Work Description: hadoop-yetus commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1158158938
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 56s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 1 new or modified test files. |
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 39m 25s | | trunk passed |
| +1 :green_heart: | compile | 1m 39s | | trunk passed with JDK
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 |
| +1 :green_heart: | compile | 1m 31s | | trunk passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | checkstyle | 1m 21s | | trunk passed |
| +1 :green_heart: | mvnsite | 1m 40s | | trunk passed |
| -1 :x: | javadoc | 1m 20s |
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt)
| hadoop-hdfs in trunk failed with JDK Private
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1. |
| +1 :green_heart: | javadoc | 1m 44s | | trunk passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 43s | | trunk passed |
| +1 :green_heart: | shadedclient | 25m 58s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 1m 24s | | the patch passed |
| +1 :green_heart: | compile | 1m 30s | | the patch passed with JDK
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 |
| +1 :green_heart: | javac | 1m 30s | | the patch passed |
| +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | javac | 1m 19s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| -0 :warning: | checkstyle | 1m 2s |
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 100 unchanged
- 0 fixed = 101 total (was 100) |
| +1 :green_heart: | mvnsite | 1m 28s | | the patch passed |
| -1 :x: | javadoc | 1m 0s |
[/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt)
| hadoop-hdfs in the patch failed with JDK Private
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1. |
| +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 35s | | the patch passed |
| +1 :green_heart: | shadedclient | 26m 0s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 381m 44s | | hadoop-hdfs in the patch
passed. |
| +1 :green_heart: | asflicense | 1m 1s | | The patch does not
generate ASF License warnings. |
| | | 498m 26s | | |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.41 ServerAPI=1.41 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/4410 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
| uname | Linux efcbee072994 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8
19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / 9dd26601ec0cb25a1de4f772e6bff084141bbfb5 |
| Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/testReport/ |
| Max. process+thread count | 1965 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U:
hadoop-hdfs-project/hadoop-hdfs |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
Issue Time Tracking
-------------------
Worklog Id: (was: 782169)
Time Spent: 1h 20m (was: 1h 10m)
> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> ------------------------------------------------------------------
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, namenode
> Affects Versions: 3.2.1
> Reporter: Kevin Wikant
> Priority: Major
> Labels: pull-request-available
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a
> non-issue under the assumption that if the namenode & a datanode get into an
> inconsistent state for a given block pipeline, there should be another
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
> * there are initially 4 datanodes DN1, DN2, DN3, DN4
> * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
> * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in
> order to satisfy their minimum replication factor of 2
> * during this replication process
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes
> the following inconsistent state:
> ** DN3 thinks it has the block pipeline in FINALIZED state
> ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]):
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654
> dst: /DN3:9866;
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
> * the replication is attempted again, but:
> ** DN4 has the block
> ** DN1 and/or DN2 have the block, but don't count towards the minimum
> replication factor because they are being decommissioned
> ** DN3 does not have the block & cannot have the block replicated to it
> because of HDFS-721
> * the namenode repeatedly tries to replicate the block to DN3 & repeatedly
> fails, this continues indefinitely
> * therefore DN4 is the only live datanode with the block & the minimum
> replication factor of 2 cannot be satisfied
> * because the minimum replication factor cannot be satisfied for the
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be
> completed
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0):
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0,
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 ,
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0):
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0,
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 ,
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of
> DataNode decommissioning
> A few potential solutions:
> * Address the root cause of the problem which is an inconsistent state
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
> * Detect when datanode decommissioning is stuck due to lack of available
> datanodes for satisfying the minimum replication factor, then recover by
> re-enabling the datanodes being decommissioned
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]