[
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=769465&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-769465
]
ASF GitHub Bot logged work on HDFS-16456:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 12/May/22 08:14
Start Date: 12/May/22 08:14
Worklog Time Spent: 10m
Work Description: hadoop-yetus commented on PR #4304:
URL: https://github.com/apache/hadoop/pull/4304#issuecomment-1124671782
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 11m 3s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 2 new or modified test files. |
|||| _ branch-3.3 Compile Tests _ |
| +0 :ok: | mvndep | 15m 42s | | Maven dependency ordering for branch |
| +1 :green_heart: | mvninstall | 23m 44s | | branch-3.3 passed |
| +1 :green_heart: | compile | 21m 44s | | branch-3.3 passed |
| +1 :green_heart: | checkstyle | 3m 47s | | branch-3.3 passed |
| +1 :green_heart: | mvnsite | 4m 25s | | branch-3.3 passed |
| +1 :green_heart: | javadoc | 4m 33s | | branch-3.3 passed |
| +1 :green_heart: | spotbugs | 6m 51s | | branch-3.3 passed |
| +1 :green_heart: | shadedclient | 28m 26s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +0 :ok: | mvndep | 0m 32s | | Maven dependency ordering for patch |
| -1 :x: | mvninstall | 0m 49s |
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch failed. |
| -1 :x: | compile | 2m 50s |
[/patch-compile-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-compile-root.txt)
| root in the patch failed. |
| -1 :x: | javac | 2m 50s |
[/patch-compile-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-compile-root.txt)
| root in the patch failed. |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 2m 29s | | the patch passed |
| -1 :x: | mvnsite | 0m 54s |
[/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch failed. |
| +1 :green_heart: | javadoc | 3m 13s | | the patch passed |
| -1 :x: | spotbugs | 0m 50s |
[/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch failed. |
| -1 :x: | shadedclient | 12m 40s | | patch has errors when building
and testing our client artifacts. |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 17m 52s | | hadoop-common in the patch
passed. |
| -1 :x: | unit | 0m 50s |
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch failed. |
| +1 :green_heart: | asflicense | 0m 43s | | The patch does not
generate ASF License warnings. |
| | | 167m 55s | | |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.41 ServerAPI=1.41 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/4304 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell |
| uname | Linux 8d3f7941228c 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | branch-3.3 / c387e506e8d0fb76d000ec507f9b44bcaee4fd69 |
| Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/testReport/ |
| Max. process+thread count | 1267 (vs. ulimit of 5500) |
| modules | C: hadoop-common-project/hadoop-common
hadoop-hdfs-project/hadoop-hdfs U: . |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/console |
| versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
This message was automatically generated.
Issue Time Tracking
-------------------
Worklog Id: (was: 769465)
Time Spent: 2.5h (was: 2h 20m)
> EC: Decommission a rack with only on dn will fail when the rack number is
> equal with replication
> ------------------------------------------------------------------------------------------------
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ec, namenode
> Affects Versions: 3.4.0
> Reporter: caozhiqiang
> Assignee: caozhiqiang
> Priority: Critical
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch,
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch,
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch,
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
> # Enable EC policy, such as RS-6-3-1024k.
> # The rack number in this cluster is equal with or less than the replication
> number(9)
> # A rack only has one DN, and decommission this DN.
> The root cause is in
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will
> give a limit parameter maxNodesPerRack for choose targets. In this scenario,
> the maxNodesPerRack is 1, which means each rack can only be chosen one
> datanode.
> {code:java}
> protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
> ...
> // If more replicas than racks, evenly spread the replicas.
> // This calculation rounds up.
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
> } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and numOfRacks=9
> When we decommission one dn which is only one node in its rack, the
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder()
> will throw NotEnoughReplicasException, but the exception will not be caught
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will
> return the total rack number contains the invalid rack, and
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false
> and it will also cause decommission fail.
> {code:java}
> public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
> int numberOfReplicas) {
> if (locs == null)
> locs = DatanodeDescriptor.EMPTY_ARRAY;
> if (!clusterMap.hasClusterEverBeenMultiRack()) {
> // only one rack
> return new BlockPlacementStatusDefault(1, 1, 1);
> }
> // Count locations on different racks.
> Set<String> racks = new HashSet<>();
> for (DatanodeInfo dn : locs) {
> racks.add(dn.getNetworkLocation());
> }
> return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
> clusterMap.getNumOfRacks());
> } {code}
> {code:java}
> public boolean isPlacementPolicySatisfied() {
> return requiredRacks <= currentRacks || currentRacks >= totalRacks;
> }{code}
> According to the above description, we should make the below modify to fix it:
> # In startDecommission() or stopDecommission(), we should also change the
> numOfRacks in class NetworkTopology. Or choose targets may fail for the
> maxNodesPerRack is too small. And even choose targets success,
> isPlacementPolicySatisfied will also return false cause decommission fail.
> # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first
> chooseOnce() function should also be put in try..catch..., or it will not
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
> # In verifyBlockPlacement, we need to remove invalid racks from total
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail
> to reconstruct data.
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]