[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-18 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304646#comment-17304646
 ] 

Wei-Chiu Chuang commented on HDFS-15901:


We have some users running 1000+ node scale clusters but I don't watch the 
clusters every day. I am honestly not the best person for opinions when it 
comes to extreme scale clusters. 

[~hexiaoqiao] or [~ferhui] may have better ideas.

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15901?focusedWorklogId=568792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568792
 ]

ASF GitHub Bot logged work on HDFS-15901:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 05:39
Start Date: 19/Mar/21 05:39
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on a change in pull request #2782:
URL: https://github.com/apache/hadoop/pull/2782#discussion_r597418628



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
##
@@ -2603,6 +2603,24 @@ public long 
requestBlockReportLeaseId(DatanodeRegistration nodeReg) {
   LOG.warn("Failed to find datanode {}", nodeReg);
   return 0;
 }
+
+// During safemode, DataNodes are only allowed to report all data once.
+if (namesystem.isInStartupSafeMode()) {
+  boolean allReported = true;
+  for (DatanodeStorageInfo storageInfo : node.getStorageInfos()) {
+if (storageInfo.getBlockReportCount() < 1) {
+  allReported = false;
+  break;
+}
+  }
+
+  if (allReported) {
+LOG.info("The DataNode has reported all blocks and does not need " +

Review comment:
   nit: print datanode id/ip/address would be helpful here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568792)
Time Spent: 0.5h  (was: 20m)

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15879) Exclude slow nodes when choose targets for blocks

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15879?focusedWorklogId=568790=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568790
 ]

ASF GitHub Bot logged work on HDFS-15879:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 05:35
Start Date: 19/Mar/21 05:35
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on a change in pull request #2748:
URL: https://github.com/apache/hadoop/pull/2748#discussion_r597417442



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestReplicationPolicyExcludeSlowNodes.java
##
@@ -0,0 +1,125 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hdfs.server.blockmanagement;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hdfs.DFSConfigKeys;
+import org.apache.hadoop.hdfs.DFSTestUtil;
+import org.apache.hadoop.hdfs.TestBlockStoragePolicy;
+import org.apache.hadoop.hdfs.server.namenode.NameNode;
+import org.apache.hadoop.net.Node;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Set;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+@RunWith(Parameterized.class)
+public class TestReplicationPolicyExcludeSlowNodes
+extends BaseReplicationPolicyTest {
+
+  public TestReplicationPolicyExcludeSlowNodes(String blockPlacementPolicy) {
+this.blockPlacementPolicy = blockPlacementPolicy;
+  }
+
+  @Parameterized.Parameters
+  public static Iterable data() {
+return Arrays.asList(new Object[][] {
+{ BlockPlacementPolicyDefault.class.getName() },
+{ BlockPlacementPolicyWithUpgradeDomain.class.getName() } });
+  }
+
+  @Override
+  DatanodeDescriptor[] getDatanodeDescriptors(Configuration conf) {
+conf.setBoolean(DFSConfigKeys
+.DFS_DATANODE_PEER_STATS_ENABLED_KEY,
+true);
+conf.setStrings(DFSConfigKeys
+.DFS_NAMENODE_SLOWPEER_COLLECT_INTERVAL_KEY,
+"1s");
+conf.setBoolean(DFSConfigKeys
+.DFS_NAMENODE_BLOCKPLACEMENTPOLICY_EXCLUDE_SLOW_NODES_ENABLED_KEY,
+true);
+final String[] racks = {
+"/rack1",
+"/rack1",
+"/rack2",
+"/rack2",
+"/rack3",
+"/rack3"};
+storages = DFSTestUtil.createDatanodeStorageInfos(racks);
+return DFSTestUtil.toDatanodeDescriptor(storages);
+  }
+
+  /**
+   * Tests that chooseTarget when excludeSlowNodesEnabled set to true
+   */
+  @Test
+  public void testChooseTargetExcludeSlowNodes() throws IOException {
+namenode.getNamesystem().writeLock();
+try {
+  // add nodes
+  for (int i = 0; i < dataNodes.length; i++) {
+dnManager.addDatanode(dataNodes[i]);
+  }
+
+  // mock slow nodes
+  SlowPeerTracker tracker = dnManager.getSlowPeerTracker();
+  tracker.addReport(dataNodes[0].getInfoAddr(), 
dataNodes[3].getInfoAddr());
+  tracker.addReport(dataNodes[0].getInfoAddr(), 
dataNodes[4].getInfoAddr());
+  tracker.addReport(dataNodes[1].getInfoAddr(), 
dataNodes[4].getInfoAddr());
+  tracker.addReport(dataNodes[1].getInfoAddr(), 
dataNodes[5].getInfoAddr());
+  tracker.addReport(dataNodes[2].getInfoAddr(), 
dataNodes[3].getInfoAddr());
+  tracker.addReport(dataNodes[2].getInfoAddr(), 
dataNodes[5].getInfoAddr());
+
+  // fetch slow nodes
+  Set slowPeers = dnManager.getSlowPeers();
+
+  // assert slow nodes
+  assertEquals(3, slowPeers.size());
+  for (int i = 0; i < slowPeers.size(); i++) {
+assertTrue(slowPeers.contains(dataNodes[i]));
+  }
+
+  // mock writer
+  DatanodeDescriptor writerDn = dataNodes[0];
+
+  // Call chooseTarget()
+  DatanodeStorageInfo[] targets = 
namenode.getNamesystem().getBlockManager()
+  .getBlockPlacementPolicy().chooseTarget("testFile.txt", 3,
+  writerDn, new ArrayList(), false, null,
+  1024, TestBlockStoragePolicy.DEFAULT_STORAGE_POLICY, null);
+
+

[jira] [Commented] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread Mingliang Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304642#comment-17304642
 ] 

Mingliang Liu commented on HDFS-15904:
--

Approved PR and left some minor message.

Not sure about HBase, but in Hadoop, before merging we only need to set target 
versions for a JIRA. When committing, the commuter will set the "Fixed 
Versions" to indicate which branch this patch eventually goes into. Thanks,

> Flaky test TestBalancer#testBalancerWithSortTopNodes()
> --
>
> Key: HDFS-15904
> URL: https://issues.apache.org/jira/browse/HDFS-15904
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 
> runs or so. It's reproducible locally also. Basically, balancing either moves 
> 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes 
> (2nd case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread Mingliang Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304642#comment-17304642
 ] 

Mingliang Liu edited comment on HDFS-15904 at 3/19/21, 5:11 AM:


Approved PR and left some minor message.

Not sure about HBase, but in Hadoop, before merging we only need to set target 
versions for a JIRA. When committing, the committer will set the "Fixed 
Versions" to indicate which branch this patch eventually goes into. Thanks,


was (Author: liuml07):
Approved PR and left some minor message.

Not sure about HBase, but in Hadoop, before merging we only need to set target 
versions for a JIRA. When committing, the commuter will set the "Fixed 
Versions" to indicate which branch this patch eventually goes into. Thanks,

> Flaky test TestBalancer#testBalancerWithSortTopNodes()
> --
>
> Key: HDFS-15904
> URL: https://issues.apache.org/jira/browse/HDFS-15904
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 
> runs or so. It's reproducible locally also. Basically, balancing either moves 
> 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes 
> (2nd case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread Mingliang Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu updated HDFS-15904:
-
Fix Version/s: (was: 3.4.0)
   Status: Patch Available  (was: Open)

> Flaky test TestBalancer#testBalancerWithSortTopNodes()
> --
>
> Key: HDFS-15904
> URL: https://issues.apache.org/jira/browse/HDFS-15904
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 
> runs or so. It's reproducible locally also. Basically, balancing either moves 
> 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes 
> (2nd case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568785=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568785
 ]

ASF GitHub Bot logged work on HDFS-15904:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 05:09
Start Date: 19/Mar/21 05:09
Worklog Time Spent: 10m 
  Work Description: liuml07 commented on a change in pull request #2785:
URL: https://github.com/apache/hadoop/pull/2785#discussion_r597410126



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer.java
##
@@ -2297,7 +2297,9 @@ public void testBalancerWithSortTopNodes() throws 
Exception {
   maxUsage = Math.max(maxUsage, datanodeReport[i].getDfsUsed());
 }
 
-assertEquals(200, balancerResult.bytesAlreadyMoved);
+// Either 2 blocks of 100+100 bytes or 3 blocks of 100+100+50 bytes

Review comment:
   Could add some explanation why this would happen.
   
   The 95% usage DN will have 9 blocks of 100 bytes and 1 block of 50 byte - 
all for the same file. The HDFS balancer will choose a block to move from this 
node randomly. More likely it will be 100B block. Since that is greater than 
`DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY` which is 99L (see above settings), it will 
stop here. Total bytes moved from this 95% DN will be 1 block and hence 100B.
   
   However, chances are the first block to move from this 95% DN is the 50B 
block. After this block being moved, the total moved size 50B is smaller than 
`DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY` , it will try to move another block. The 
second block will always  be 100 bytes. So total bytes moved from this 95% DN 
will be 2 blocks and hence 150B (100B + 50B).
   
   Please reword or rephrase this as comment before this assertion so readers 
can have more context without thinking too much again.
   
   Thanks,




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568785)
Time Spent: 50m  (was: 40m)

> Flaky test TestBalancer#testBalancerWithSortTopNodes()
> --
>
> Key: HDFS-15904
> URL: https://issues.apache.org/jira/browse/HDFS-15904
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 
> runs or so. It's reproducible locally also. Basically, balancing either moves 
> 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes 
> (2nd case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15894) Trace Time-consuming RPC response of certain threshold.

2021-03-18 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304633#comment-17304633
 ] 

Renukaprasad C commented on HDFS-15894:
---

Uploaded patch - HDFS-15894.003.patch with all static check fix. Failed test 
are not related to the code changes done.
[~surendralilhore] Can you please help to review the changes?

> Trace Time-consuming RPC response of certain threshold.
> ---
>
> Key: HDFS-15894
> URL: https://issues.apache.org/jira/browse/HDFS-15894
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Attachments: HDFS-15894.001.patch, HDFS-15894.002.patch, 
> HDFS-15894.003.patch
>
>
> Monitor & Trace Time-consuming RPC requests.
> Sometimes RPC Requests gets delayed, which impacts the system performance. 
> Currently, there is no track for delayed RPC request. 
> We can log such delayed RPC calls which exceeds certain threshold.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568765=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568765
 ]

ASF GitHub Bot logged work on HDFS-15904:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 03:55
Start Date: 19/Mar/21 03:55
Worklog Time Spent: 10m 
  Work Description: virajjasani commented on pull request #2785:
URL: https://github.com/apache/hadoop/pull/2785#issuecomment-802532273


   Could you please take a look @liuml07 @tasanuma ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568765)
Time Spent: 40m  (was: 0.5h)

> Flaky test TestBalancer#testBalancerWithSortTopNodes()
> --
>
> Key: HDFS-15904
> URL: https://issues.apache.org/jira/browse/HDFS-15904
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 
> runs or so. It's reproducible locally also. Basically, balancing either moves 
> 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes 
> (2nd case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568755=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568755
 ]

ASF GitHub Bot logged work on HDFS-15900:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 03:40
Start Date: 19/Mar/21 03:40
Worklog Time Spent: 10m 
  Work Description: goiri commented on a change in pull request #2787:
URL: https://github.com/apache/hadoop/pull/2787#discussion_r597386195



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/resolver/FederationNamespaceInfo.java
##
@@ -75,4 +76,27 @@ public String getBlockPoolId() {
   public String toString() {
 return this.nameserviceId + "->" + this.blockPoolId + ":" + this.clusterId;
   }
-}
\ No newline at end of file
+
+  @Override
+  public boolean equals(Object obj) {
+if (this == obj) {
+  return true;
+}
+if (obj instanceof FederationNamespaceInfo) {

Review comment:
   There is also EqualsBuilder in commons.

##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java
##
@@ -213,12 +213,14 @@ public boolean loadCache(boolean force) throws 
IOException {
 nnRegistrations.put(nnId, nnRegistration);
   }
   nnRegistration.add(membership);
-  String bpId = membership.getBlockPoolId();
-  String cId = membership.getClusterId();
-  String nsId = membership.getNameserviceId();
-  FederationNamespaceInfo nsInfo =
-  new FederationNamespaceInfo(bpId, cId, nsId);
-  this.activeNamespaces.add(nsInfo);
+  if (membership.getState() != 
FederationNamenodeServiceState.UNAVAILABLE) {
+String bpId = membership.getBlockPoolId();

Review comment:
   Is there any test we can do for this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568755)
Time Spent: 40m  (was: 0.5h)

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
>  Labels: pull-request-available
> Attachments: image.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-15900:
---
Status: Patch Available  (was: Open)

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
>  Labels: pull-request-available
> Attachments: image.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568744=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568744
 ]

ASF GitHub Bot logged work on HDFS-15900:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 03:15
Start Date: 19/Mar/21 03:15
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2787:
URL: https://github.com/apache/hadoop/pull/2787#issuecomment-802513718


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 37s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 35s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 42s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 17s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  14m 13s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 18s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2787/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  14m  5s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  17m 37s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2787/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  91m 38s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.federation.router.TestRouterRpcMultiDestination |
   |   | hadoop.hdfs.server.federation.router.TestRouterRpc |
   |   | hadoop.hdfs.server.federation.resolver.TestFederationNamespaceInfo |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2787/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2787 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 52e780355c42 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 8983b9f309e260d69c4fa932e38aaf418da515c2 |
   | Default 

[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-18 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304607#comment-17304607
 ] 

JiangHua Zhu commented on HDFS-15901:
-

[~kihwal], thank you very much for your message.
I think the FBR lease mechanism is still needed, because it will reduce the 
pressure on NN. It is just that during the restart of the NN, after the FBR for 
the DN is completed once, the NN should not allow the DN to complete a new FBR 
action again.
[~weichiu] , do you have any other good opinions?


> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread Harunobu Daikoku (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304580#comment-17304580
 ] 

Harunobu Daikoku commented on HDFS-15900:
-

Thanks for explanation.
I have done the two fixes above and submitted the patch.

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
>  Labels: pull-request-available
> Attachments: image.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568728=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568728
 ]

ASF GitHub Bot logged work on HDFS-15900:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 01:46
Start Date: 19/Mar/21 01:46
Worklog Time Spent: 10m 
  Work Description: hdaikoku commented on a change in pull request #2787:
URL: https://github.com/apache/hadoop/pull/2787#discussion_r597352892



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/resolver/TestFederationNamespaceInfo.java
##
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hdfs.server.federation.resolver;
+
+import org.junit.Test;
+
+import java.util.Set;
+import java.util.TreeSet;
+
+import static org.assertj.core.api.Assertions.assertThat;
+
+public class TestFederationNamespaceInfo {
+  /**
+   * Regression test for HDFS-15900.

Review comment:
   Provided courtesy of @aajisaka 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568728)
Time Spent: 20m  (was: 10m)

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
>  Labels: pull-request-available
> Attachments: image.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568726=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568726
 ]

ASF GitHub Bot logged work on HDFS-15900:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 01:42
Start Date: 19/Mar/21 01:42
Worklog Time Spent: 10m 
  Work Description: hdaikoku opened a new pull request #2787:
URL: https://github.com/apache/hadoop/pull/2787


   https://issues.apache.org/jira/browse/HDFS-15900
   
   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568726)
Remaining Estimate: 0h
Time Spent: 10m

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
> Attachments: image.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15900:
--
Labels: pull-request-available  (was: )

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
>  Labels: pull-request-available
> Attachments: image.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568723=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568723
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 01:39
Start Date: 19/Mar/21 01:39
Worklog Time Spent: 10m 
  Work Description: runitao commented on a change in pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585#discussion_r597350941



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/erasurecode/StripedBlockReconstructor.java
##
@@ -126,12 +128,26 @@ private void reconstructTargets(int toReconstructLen) 
throws IOException {
 int[] erasedIndices = stripedWriter.getRealTargetIndices();
 ByteBuffer[] outputs = 
stripedWriter.getRealTargetBuffers(toReconstructLen);
 
+if (isValidationEnabled()) {
+  markBuffers(inputs);
+  decode(inputs, erasedIndices, outputs);
+  resetBuffers(inputs);
+
+  DataNodeFaultInjector.get().badDecoding(outputs);
+  getValidator().validate(inputs, erasedIndices, outputs);

Review comment:
   +1,I have no other suggestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568723)
Time Spent: 5h 40m  (was: 5.5h)

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568724
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 19/Mar/21 01:39
Start Date: 19/Mar/21 01:39
Worklog Time Spent: 10m 
  Work Description: runitao commented on a change in pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585#discussion_r597351035



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestReconstructStripedFileWithValidator.java
##
@@ -0,0 +1,98 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hdfs;
+
+import org.apache.hadoop.hdfs.server.datanode.DataNodeFaultInjector;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.nio.ByteBuffer;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+/**
+ * This test extends {@link TestReconstructStripedFile} to test
+ * ec reconstruction validation.
+ */
+public class TestReconstructStripedFileWithValidator
+extends TestReconstructStripedFile {
+  private static final Logger LOG =
+  LoggerFactory.getLogger(TestReconstructStripedFileWithValidator.class);
+
+  public TestReconstructStripedFileWithValidator() {
+LOG.info("run {} with validator.",
+TestReconstructStripedFileWithValidator.class.getSuperclass()
+.getSimpleName());
+  }
+
+  /**
+   * This test injects data pollution into decoded outputs once.
+   * When validation enabled, the first reconstruction task should fail
+   * in the validation, but the data will be recovered correctly
+   * by the next task.
+   * On the other hand, when validation disabled, the first reconstruction task
+   * will succeed and then lead to data corruption.
+   */
+  @Test(timeout = 12)
+  public void testValidatorWithBadDecoding()
+  throws Exception {
+DataNodeFaultInjector oldInjector = DataNodeFaultInjector.get();
+DataNodeFaultInjector badDecodingInjector = new DataNodeFaultInjector() {
+  private final AtomicBoolean flag = new AtomicBoolean(false);
+
+  @Override
+  public void badDecoding(ByteBuffer[] outputs) {
+if (!flag.get()) {
+  for (ByteBuffer output : outputs) {
+output.mark();
+output.put((byte) (output.get(output.position()) + 1));
+output.reset();
+  }
+}
+flag.set(true);
+  }
+};
+DataNodeFaultInjector.set(badDecodingInjector);
+int fileLen =
+(getEcPolicy().getNumDataUnits() + getEcPolicy().getNumParityUnits())
+* getBlockSize() + getBlockSize() / 10;
+try {
+  assertFileBlocksReconstruction(
+  "/testValidatorWithBadDecoding",
+  fileLen,
+  ReconstructionType.DataOnly,
+  getEcPolicy().getNumParityUnits());

Review comment:
   +1 too




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568724)
Time Spent: 5h 50m  (was: 5h 40m)

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> 

[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-18 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304503#comment-17304503
 ] 

Kihwal Lee edited comment on HDFS-15901 at 3/18/21, 10:32 PM:
--

The block report lease feature is supposed to improve this, but it ended up 
causing more problems in our experiences.  One of the main reasons of duplicate 
reporting is lack of ability to retransmit single report on rpc timeout.  On 
startup, the NN's call queue can be easily overwhelmed since the FBR processing 
is relatively slow. It is common to see the processing of a single storage 
taking 100s of milliseconds. A half dozen storage reports can take up a while 
second. You can easily imagine more than 60 seconds worth of reports waiting in 
the call queue, which will cause a timeout for some of the reports. 
Unfortunately, datanode's full block reporting does not retransmit the affected 
report only.  It regenerates the whole thing and start all over again.  Even if 
only the last storage FBR had a trouble, it will retransmit everything again.

The reason why it sometimes stuck in safe mode is likely the curse of the block 
report lease. When FBR is retransmitted, the feature will make the NN to drop 
the reports.  We have seen this happening in big clusters.  If the block report 
lease wasn't there, it wouldn't have stuck in safe mode.

We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system.  It was designed by [~daryn].  It hasn't been 
tested fully yet, so we haven't shared it with the community. 


was (Author: kihwal):
The block report lease feature is supposed to improve this, but it ended up 
causing more problems in our experiences.  One of the main reasons of duplicate 
reporting is lack of ability to retransmit single report on rpc timeout.  On 
startup, the NN's call queue can be easily overwhelmed since the FBR processing 
relatively slow. It is common to see a processing of a single storage taking 
100s of milliseconds. A half dozen storage reports can take up a while second. 
If you have enough in the call queue, the queue time can easily exceed the 60 
second timeout for some of the nodes. Unfortunately, datanode's full block 
reporting does not retransmit the affected report only.  It regenerates the 
whole thing and start all over again.  Even if only the last storage FBR had a 
trouble, it will retransmit everything again.

The reason why it sometimes stuck in safe mode is likely the curse of the block 
report lease. When FBR is retransmitted, the feature will make the NN to drop 
the reports.  We have seen this happening in big clusters.  If the block report 
lease wasn't there, it wouldn't have stuck in safe mode.

We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system.  It was designed by [~daryn].  It hasn't been 
tested fully yet, so we haven't shared it with the community. 

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending 

[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-18 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304503#comment-17304503
 ] 

Kihwal Lee commented on HDFS-15901:
---

The block report lease feature is supposed to improve this, but it ended up 
causing more problems in our experiences.  One of the main reasons of duplicate 
reporting is lack of ability to retransmit single report on rpc timeout.  On 
startup, the NN's call queue can be easily overwhelmed since the FBR processing 
relatively slow. It is common to see a processing of a single storage taking 
100s of milliseconds. A half dozen storage reports can take up a while second. 
If you have enough in the call queue, the queue time can easily exceed the 60 
second timeout for some of the nodes. Unfortunately, datanode's full block 
reporting does not retransmit the affected report only.  It regenerates the 
whole thing and start all over again.  Even if only the last storage FBR had a 
trouble, it will retransmit everything again.

The reason why it sometimes stuck in safe mode is likely the curse of the block 
report lease. When FBR is retransmitted, the feature will make the NN to drop 
the reports.  We have seen this happening in big clusters.  If the block report 
lease wasn't there, it wouldn't have stuck in safe mode.

We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system.  It was designed by [~daryn].  It hasn't been 
tested fully yet, so we haven't shared it with the community. 

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15905) RBF: Improve Router performance with router redirection

2021-03-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1730#comment-1730
 ] 

Íñigo Goiri commented on HDFS-15905:


If I understand correctly, the proposal is to extend the client to query the 
Router and then contact the subcluster directly?
To be honest, this is very similar to ViewFs; you could potentially extend 
ViewFs to request the mount table from the Router.
I'll let other chime in on the token aspect.

Regarding the performance issues, how many Routers are you using for how many 
namenodes?

> RBF: Improve Router performance with router redirection
> ---
>
> Key: HDFS-15905
> URL: https://issues.apache.org/jira/browse/HDFS-15905
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Affects Versions: 3.1.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>
> Router implementation currently takes the proxy approach to handle the client 
> requests: the routers receive the requests from the clients and send the 
> requests to the target clusters on behalf of the clients. 
> This approach works well,  while after moving more clusters on top of 
> routers, we are seeing that routers are becoming the bottleneck since e.g., 
> without RBF, the clients themselves manage the connections for themselves, 
> while with RBF, the limited routers manage much more connections for the 
> clients; we also keep idle connections to boost the connection performance. 
> We have done some work to tune connection management but it doesn't help much.
> We are proposing to reduce the functionality on the router side and use them 
> as actual router instead of proxy: the clients talk to routers to resolve 
> target cluster info given a path and get router delegation token; the clients 
> directly send the requests to target cluster.
> A big challenge here is the token authentication against target cluster with 
> router token only. One approach: we can ask router to return target cluster 
> token along with router token so the clients can authenticate against target 
> cluster. Second approach:  similar to block token mechanism, the router 
> exchanges secret keys with target clusters through heart-beats so the clients 
> can authenticate with target cluster with that router token.
> I would like to know your feedback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15905) RBF: Improve Router performance with router redirection

2021-03-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-15905:
---
Summary: RBF: Improve Router performance with router redirection  (was: 
Improve Router performance with router redirection)

> RBF: Improve Router performance with router redirection
> ---
>
> Key: HDFS-15905
> URL: https://issues.apache.org/jira/browse/HDFS-15905
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Affects Versions: 3.1.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>
> Router implementation currently takes the proxy approach to handle the client 
> requests: the routers receive the requests from the clients and send the 
> requests to the target clusters on behalf of the clients. 
> This approach works well,  while after moving more clusters on top of 
> routers, we are seeing that routers are becoming the bottleneck since e.g., 
> without RBF, the clients themselves manage the connections for themselves, 
> while with RBF, the limited routers manage much more connections for the 
> clients; we also keep idle connections to boost the connection performance. 
> We have done some work to tune connection management but it doesn't help much.
> We are proposing to reduce the functionality on the router side and use them 
> as actual router instead of proxy: the clients talk to routers to resolve 
> target cluster info given a path and get router delegation token; the clients 
> directly send the requests to target cluster.
> A big challenge here is the token authentication against target cluster with 
> router token only. One approach: we can ask router to return target cluster 
> token along with router token so the clients can authenticate against target 
> cluster. Second approach:  similar to block token mechanism, the router 
> exchanges secret keys with target clusters through heart-beats so the clients 
> can authenticate with target cluster with that router token.
> I would like to know your feedback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568603=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568603
 ]

ASF GitHub Bot logged work on HDFS-15904:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 19:13
Start Date: 18/Mar/21 19:13
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2785:
URL: https://github.com/apache/hadoop/pull/2785#issuecomment-802218023


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 59s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  2s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 46s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  3s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 17s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  18m 40s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 55s |  |  
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 272 unchanged - 4 
fixed = 272 total (was 276)  |
   | +1 :green_heart: |  mvnsite  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 20s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  19m  7s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 363m 53s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 457m  5s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
   |   | hadoop.hdfs.TestDFSShell |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList |
   |   | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots |
   |   | hadoop.hdfs.server.datanode.TestBlockScanner |
   |   | 
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.TestPersistBlocks |
   |   | hadoop.hdfs.TestViewDistributedFileSystem |
   |   | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
   |   | hadoop.hdfs.TestViewDistributedFileSystemContract |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2785 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle 

[jira] [Work logged] (HDFS-15868) Possible Resource Leak in EditLogFileOutputStream

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15868?focusedWorklogId=568591=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568591
 ]

ASF GitHub Bot logged work on HDFS-15868:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 18:54
Start Date: 18/Mar/21 18:54
Worklog Time Spent: 10m 
  Work Description: Nargeshdb commented on pull request #2736:
URL: https://github.com/apache/hadoop/pull/2736#issuecomment-802206225


   Should we expect [these 
tests](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2736/5/testReport/)
 pass or not?
   I checked test failures in 
[this](https://github.com/apache/hadoop/pull/2784) PR that was made yesterday 
and I found 8 common failures between these two PRs. I was wondering if you 
could confirm that these failures are excepted. @Hexiaoqiao
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568591)
Time Spent: 2h 20m  (was: 2h 10m)

> Possible Resource Leak in EditLogFileOutputStream
> -
>
> Key: HDFS-15868
> URL: https://issues.apache.org/jira/browse/HDFS-15868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Narges Shadab
>Assignee: Narges Shadab
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> We noticed a possible resource leak 
> [here|https://github.com/apache/hadoop/blob/1f1a1ef52df896a2b66b16f5bbc17aa39b1a1dd7/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java#L91].
>  If an I/O error occurs at line 91, rp remains open since the exception isn't 
> caught locally, and there is no way for any caller to close the 
> RandomAccessFile.
>  I'll submit a pull request to fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568589=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568589
 ]

ASF GitHub Bot logged work on HDFS-15904:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 18:50
Start Date: 18/Mar/21 18:50
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2785:
URL: https://github.com/apache/hadoop/pull/2785#issuecomment-802203762


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  2s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 57s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 59s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 22s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 16s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  18m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  6s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 55s |  |  
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 272 unchanged - 4 
fixed = 272 total (was 276)  |
   | +1 :green_heart: |  mvnsite  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 342m 45s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 39s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 435m 13s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS |
   |   | hadoop.hdfs.server.datanode.TestIncrementalBrVariations |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots |
   |   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
   |   | hadoop.hdfs.TestDFSShell |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList |
   |   | hadoop.hdfs.server.datanode.TestBlockScanner |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.TestPersistBlocks |
   |   | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
   |   | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2785 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs 

[jira] [Work logged] (HDFS-15868) Possible Resource Leak in EditLogFileOutputStream

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15868?focusedWorklogId=568528=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568528
 ]

ASF GitHub Bot logged work on HDFS-15868:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 17:43
Start Date: 18/Mar/21 17:43
Worklog Time Spent: 10m 
  Work Description: Nargeshdb commented on pull request #2736:
URL: https://github.com/apache/hadoop/pull/2736#issuecomment-802156536


   @Hexiaoqiao
   We are investigating the test failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568528)
Time Spent: 2h 10m  (was: 2h)

> Possible Resource Leak in EditLogFileOutputStream
> -
>
> Key: HDFS-15868
> URL: https://issues.apache.org/jira/browse/HDFS-15868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Narges Shadab
>Assignee: Narges Shadab
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We noticed a possible resource leak 
> [here|https://github.com/apache/hadoop/blob/1f1a1ef52df896a2b66b16f5bbc17aa39b1a1dd7/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java#L91].
>  If an I/O error occurs at line 91, rp remains open since the exception isn't 
> caught locally, and there is no way for any caller to close the 
> RandomAccessFile.
>  I'll submit a pull request to fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15874?focusedWorklogId=568521=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568521
 ]

ASF GitHub Bot logged work on HDFS-15874:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 17:36
Start Date: 18/Mar/21 17:36
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2744:
URL: https://github.com/apache/hadoop/pull/2744#issuecomment-802151603


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 36s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 20s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m  7s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   4m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   4m 27s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 19s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m  2s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   2m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   4m 20s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  13m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 41s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   4m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   4m 45s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   4m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   4m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 11s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 14 new + 697 unchanged - 1 fixed = 
711 total (was 698)  |
   | +1 :green_heart: |  mvnsite  |   1m 45s |  |  the patch passed  |
   | +1 :green_heart: |  xml  |   0m  1s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  javadoc  |   1m 17s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   2m 13s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   4m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  14m  3s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 227m 18s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  unit  |  17m 41s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 44s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 353m 55s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2744 |
   | JIRA Issue | HDFS-15874 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell xml |
   | uname | Linux 52dbd6584158 4.15.0-58-generic 

[jira] [Commented] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-18 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304307#comment-17304307
 ] 

Hadoop QA commented on HDFS-15874:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
36s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} codespell {color} | {color:blue}  0m  
0s{color} |  | {color:blue} codespell was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 3 new or modified 
test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 14m 
20s{color} |  | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 7s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
51s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
27s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
19s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m  
2s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
35s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
24s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  4m 
20s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 59s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
27s{color} |  | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
41s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
45s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  4m 
45s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
22s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  4m 
22s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} blanks {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch has no blanks issues. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 11s{color} | 
[/results-checkstyle-hadoop-hdfs-project.txt|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/results-checkstyle-hadoop-hdfs-project.txt]
 | {color:orange} hadoop-hdfs-project: The patch generated 14 new + 697 
unchanged - 1 fixed = 711 total (was 698) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
45s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} |  | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
17s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
13s{color} |  | {color:green} the patch passed with JDK Private 

[jira] [Commented] (HDFS-15905) Improve Router performance with router redirection

2021-03-18 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304294#comment-17304294
 ] 

Aihua Xu commented on HDFS-15905:
-

[~elgoiri], [~jingzhao], [~fengnanli] Can you provide any feedback/suggestion? 
Thanks a lot.

> Improve Router performance with router redirection
> --
>
> Key: HDFS-15905
> URL: https://issues.apache.org/jira/browse/HDFS-15905
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Affects Versions: 3.1.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>
> Router implementation currently takes the proxy approach to handle the client 
> requests: the routers receive the requests from the clients and send the 
> requests to the target clusters on behalf of the clients. 
> This approach works well,  while after moving more clusters on top of 
> routers, we are seeing that routers are becoming the bottleneck since e.g., 
> without RBF, the clients themselves manage the connections for themselves, 
> while with RBF, the limited routers manage much more connections for the 
> clients; we also keep idle connections to boost the connection performance. 
> We have done some work to tune connection management but it doesn't help much.
> We are proposing to reduce the functionality on the router side and use them 
> as actual router instead of proxy: the clients talk to routers to resolve 
> target cluster info given a path and get router delegation token; the clients 
> directly send the requests to target cluster.
> A big challenge here is the token authentication against target cluster with 
> router token only. One approach: we can ask router to return target cluster 
> token along with router token so the clients can authenticate against target 
> cluster. Second approach:  similar to block token mechanism, the router 
> exchanges secret keys with target clusters through heart-beats so the clients 
> can authenticate with target cluster with that router token.
> I would like to know your feedback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15905) Improve Router performance with router redirection

2021-03-18 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-15905:
---

 Summary: Improve Router performance with router redirection
 Key: HDFS-15905
 URL: https://issues.apache.org/jira/browse/HDFS-15905
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: rbf
Affects Versions: 3.1.0
Reporter: Aihua Xu
Assignee: Aihua Xu


Router implementation currently takes the proxy approach to handle the client 
requests: the routers receive the requests from the clients and send the 
requests to the target clusters on behalf of the clients. 

This approach works well,  while after moving more clusters on top of routers, 
we are seeing that routers are becoming the bottleneck since e.g., without RBF, 
the clients themselves manage the connections for themselves, while with RBF, 
the limited routers manage much more connections for the clients; we also keep 
idle connections to boost the connection performance. We have done some work to 
tune connection management but it doesn't help much.

We are proposing to reduce the functionality on the router side and use them as 
actual router instead of proxy: the clients talk to routers to resolve target 
cluster info given a path and get router delegation token; the clients directly 
send the requests to target cluster.

A big challenge here is the token authentication against target cluster with 
router token only. One approach: we can ask router to return target cluster 
token along with router token so the clients can authenticate against target 
cluster. Second approach:  similar to block token mechanism, the router 
exchanges secret keys with target clusters through heart-beats so the clients 
can authenticate with target cluster with that router token.

I would like to know your feedback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15894) Trace Time-consuming RPC response of certain threshold.

2021-03-18 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304280#comment-17304280
 ] 

Hadoop QA commented on HDFS-15894:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
12s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
59s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
33s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
23s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
32s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
19m 34s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
34s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 25m 
44s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  3m 
32s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
25s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
29s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
29s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green}{color} | {color:green} 
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 512 unchanged - 1 
fixed = 512 total (was 513) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
26s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green}{color} | {color:green} The patch has no ill-formed 
XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 32s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} 

[jira] [Work logged] (HDFS-15903) Refactor X-Platform library

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15903?focusedWorklogId=568479=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568479
 ]

ASF GitHub Bot logged work on HDFS-15903:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 16:40
Start Date: 18/Mar/21 16:40
Worklog Time Spent: 10m 
  Work Description: GauthamBanasandra commented on pull request #2783:
URL: https://github.com/apache/hadoop/pull/2783#issuecomment-802097264


   @aajisaka could you also please review my PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568479)
Time Spent: 50m  (was: 40m)

> Refactor X-Platform library
> ---
>
> Key: HDFS-15903
> URL: https://issues.apache.org/jira/browse/HDFS-15903
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs++
>Affects Versions: 3.2.2
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> X-Platform started out as a utility to help in writing cross platform code in 
> Hadoop. As its scope expanding to cover various scenarios, it is necessary to 
> refactor it in early stages to provide proper organization and growth of the 
> X-Platform library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-18 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304175#comment-17304175
 ] 

Hadoop QA commented on HDFS-15899:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
23s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
40s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
56s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m  
4s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
43s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
11s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 32s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
28s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
56s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 27m 
32s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  5m 
39s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
 4s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m 
42s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  5m 
42s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
45s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  4m 
45s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m  
0s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green}{color} | {color:green} The patch has no ill-formed 
XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient 

[jira] [Commented] (HDFS-13975) TestBalancer#testMaxIterationTime fails sporadically

2021-03-18 Thread Toshihiko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304146#comment-17304146
 ] 

Toshihiko Uchida commented on HDFS-13975:
-

[~aajisaka] Thanks for your review and commit, too!

> TestBalancer#testMaxIterationTime fails sporadically
> 
>
> Key: HDFS-13975
> URL: https://issues.apache.org/jira/browse/HDFS-13975
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Jason Darrell Lowe
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: flaky-test, pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A number of precommit builds have seen this test fail like this:
> {noformat}
> java.lang.AssertionError: Unexpected iteration runtime: 4021ms > 3.5s
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> org.apache.hadoop.hdfs.server.balancer.TestBalancer.testMaxIterationTime(TestBalancer.java:1649)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568299=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568299
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 12:24
Start Date: 18/Mar/21 12:24
Worklog Time Spent: 10m 
  Work Description: touchida commented on a change in pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585#discussion_r596818639



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/erasurecode/StripedBlockReconstructor.java
##
@@ -126,12 +128,26 @@ private void reconstructTargets(int toReconstructLen) 
throws IOException {
 int[] erasedIndices = stripedWriter.getRealTargetIndices();
 ByteBuffer[] outputs = 
stripedWriter.getRealTargetBuffers(toReconstructLen);
 
+if (isValidationEnabled()) {
+  markBuffers(inputs);
+  decode(inputs, erasedIndices, outputs);
+  resetBuffers(inputs);
+
+  DataNodeFaultInjector.get().badDecoding(outputs);
+  getValidator().validate(inputs, erasedIndices, outputs);

Review comment:
   @runitao
   Thanks for your comment!
   How about adding a metric for the exception like 
`EcInvalidReconstructionTasks`? (I saw your deleted comment.)
   As for logging, I think it's better to output more messages through the 
entire EC reconstruction process, and so I'd like to handle it in another issue.
   Are you suggesting anything else?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568299)
Time Spent: 5h  (was: 4h 50m)

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568307=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568307
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 12:31
Start Date: 18/Mar/21 12:31
Worklog Time Spent: 10m 
  Work Description: touchida commented on a change in pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585#discussion_r596823600



##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/erasurecode/rawcoder/DecodingValidator.java
##
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.erasurecode.rawcoder;
+
+import org.apache.hadoop.classification.InterfaceAudience;
+import 
org.apache.hadoop.thirdparty.com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.io.erasurecode.ECChunk;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+/**
+ * A utility class to validate decoding.
+ */
+@InterfaceAudience.Private
+public class DecodingValidator {
+
+  private final RawErasureDecoder decoder;
+  private ByteBuffer buffer;
+  private int[] newValidIndexes;
+  private int newErasedIndex;
+
+  public DecodingValidator(RawErasureDecoder decoder) {
+this.decoder = decoder;
+  }
+
+  /**
+   * Validate outputs decoded from inputs, by decoding an input back from
+   * the outputs and comparing it with the original one.
+   *
+   * For instance, in RS (6, 3), let (d0, d1, d2, d3, d4, d5) be sources
+   * and (p0, p1, p2) be parities, and assume
+   *  inputs = [d0, null (d1), d2, d3, d4, d5, null (p0), p1, null (p2)];
+   *  erasedIndexes = [1, 6];
+   *  outputs = [d1, p1].
+   * Then
+   *  1. Create new inputs, erasedIndexes and outputs for validation so that
+   * the inputs could contain the decoded outputs, and decode them:
+   *  newInputs = [d1, d2, d3, d4, d5, p1]
+   *  newErasedIndexes = [0]
+   *  newOutputs = [d0']
+   *  2. Compare d0 and d0'. The comparison will fail with high probability
+   * when the initial outputs are wrong.
+   *
+   * Note that the input buffers' positions must be the ones where data are
+   * read: If the input buffers have been processed by a decoder, the buffers'
+   * positions must be reset before being passed into this method.
+   *
+   * This method does not change outputs and erasedIndexes.
+   *
+   * @param inputs input buffers used for decoding. The buffers' position
+   *   are moved to the end after this method.
+   * @param erasedIndexes indexes of erased units used for decoding
+   * @param outputs decoded output buffers, which are ready to be read after
+   *the call
+   * @throws IOException
+   */
+  public void validate(ByteBuffer[] inputs, int[] erasedIndexes,
+  ByteBuffer[] outputs) throws IOException {
+markBuffers(outputs);
+
+try {
+  ByteBuffer validInput = CoderUtil.findFirstValidInput(inputs);
+  boolean isDirect = validInput.isDirect();
+  int capacity = validInput.capacity();
+  int remaining = validInput.remaining();
+
+  // Init buffer
+  if (buffer == null) {
+buffer = allocateBuffer(isDirect, capacity);
+  } else if (buffer.isDirect() != isDirect
+  || buffer.capacity() < remaining) {
+buffer = allocateBuffer(isDirect, capacity);
+  }

Review comment:
   @aajisaka Thanks for your review and suggestion! 
   It simplifies the code.
   I'll fix it up.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568307)
Time Spent: 5.5h  (was: 5h 20m)

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
>  

[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568301
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 12:26
Start Date: 18/Mar/21 12:26
Worklog Time Spent: 10m 
  Work Description: touchida commented on a change in pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585#discussion_r596818639



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/erasurecode/StripedBlockReconstructor.java
##
@@ -126,12 +128,26 @@ private void reconstructTargets(int toReconstructLen) 
throws IOException {
 int[] erasedIndices = stripedWriter.getRealTargetIndices();
 ByteBuffer[] outputs = 
stripedWriter.getRealTargetBuffers(toReconstructLen);
 
+if (isValidationEnabled()) {
+  markBuffers(inputs);
+  decode(inputs, erasedIndices, outputs);
+  resetBuffers(inputs);
+
+  DataNodeFaultInjector.get().badDecoding(outputs);
+  getValidator().validate(inputs, erasedIndices, outputs);

Review comment:
   @runitao Thanks for your comment!
   How about adding a metric for the exception like 
`EcInvalidReconstructionTasks`? (I saw your deleted comment.)
   As for logging, I think it's better to output more messages through the 
entire EC reconstruction process, and so I'd like to handle it in another issue.
   Are you suggesting anything else?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568301)
Time Spent: 5h 10m  (was: 5h)

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568303=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568303
 ]

ASF GitHub Bot logged work on HDFS-15759:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 12:26
Start Date: 18/Mar/21 12:26
Worklog Time Spent: 10m 
  Work Description: touchida commented on a change in pull request #2585:
URL: https://github.com/apache/hadoop/pull/2585#discussion_r596820273



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestReconstructStripedFileWithValidator.java
##
@@ -0,0 +1,98 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hdfs;
+
+import org.apache.hadoop.hdfs.server.datanode.DataNodeFaultInjector;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.nio.ByteBuffer;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+/**
+ * This test extends {@link TestReconstructStripedFile} to test
+ * ec reconstruction validation.
+ */
+public class TestReconstructStripedFileWithValidator
+extends TestReconstructStripedFile {
+  private static final Logger LOG =
+  LoggerFactory.getLogger(TestReconstructStripedFileWithValidator.class);
+
+  public TestReconstructStripedFileWithValidator() {
+LOG.info("run {} with validator.",
+TestReconstructStripedFileWithValidator.class.getSuperclass()
+.getSimpleName());
+  }
+
+  /**
+   * This test injects data pollution into decoded outputs once.
+   * When validation enabled, the first reconstruction task should fail
+   * in the validation, but the data will be recovered correctly
+   * by the next task.
+   * On the other hand, when validation disabled, the first reconstruction task
+   * will succeed and then lead to data corruption.
+   */
+  @Test(timeout = 12)
+  public void testValidatorWithBadDecoding()
+  throws Exception {
+DataNodeFaultInjector oldInjector = DataNodeFaultInjector.get();
+DataNodeFaultInjector badDecodingInjector = new DataNodeFaultInjector() {
+  private final AtomicBoolean flag = new AtomicBoolean(false);
+
+  @Override
+  public void badDecoding(ByteBuffer[] outputs) {
+if (!flag.get()) {
+  for (ByteBuffer output : outputs) {
+output.mark();
+output.put((byte) (output.get(output.position()) + 1));
+output.reset();
+  }
+}
+flag.set(true);
+  }
+};
+DataNodeFaultInjector.set(badDecodingInjector);
+int fileLen =
+(getEcPolicy().getNumDataUnits() + getEcPolicy().getNumParityUnits())
+* getBlockSize() + getBlockSize() / 10;
+try {
+  assertFileBlocksReconstruction(
+  "/testValidatorWithBadDecoding",
+  fileLen,
+  ReconstructionType.DataOnly,
+  getEcPolicy().getNumParityUnits());

Review comment:
   @runitao Agree!
   I'm now considering to check the metric that I mentioned at 
https://github.com/apache/hadoop/pull/2585#discussion_r596818639.
   If you know an another idea, please let me know.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568303)
Time Spent: 5h 20m  (was: 5h 10m)

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 20m
>  

[jira] [Updated] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15904:
--
Labels: pull-request-available  (was: )

> Flaky test TestBalancer#testBalancerWithSortTopNodes()
> --
>
> Key: HDFS-15904
> URL: https://issues.apache.org/jira/browse/HDFS-15904
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 
> runs or so. It's reproducible locally also. Basically, balancing either moves 
> 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes 
> (2nd case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568267=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568267
 ]

ASF GitHub Bot logged work on HDFS-15904:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 11:34
Start Date: 18/Mar/21 11:34
Worklog Time Spent: 10m 
  Work Description: virajjasani opened a new pull request #2785:
URL: https://github.com/apache/hadoop/pull/2785


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 568267)
Remaining Estimate: 0h
Time Spent: 10m

> Flaky test TestBalancer#testBalancerWithSortTopNodes()
> --
>
> Key: HDFS-15904
> URL: https://issues.apache.org/jira/browse/HDFS-15904
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 
> runs or so. It's reproducible locally also. Basically, balancing either moves 
> 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes 
> (2nd case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()

2021-03-18 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-15904:
---

 Summary: Flaky test TestBalancer#testBalancerWithSortTopNodes()
 Key: HDFS-15904
 URL: https://issues.apache.org/jira/browse/HDFS-15904
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 3.4.0


TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 runs 
or so. It's reproducible locally also. Basically, balancing either moves 2 
blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes (2nd 
case causes flakies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15895) DFSAdmin#printOpenFiles has redundant String#format usage

2021-03-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304057#comment-17304057
 ] 

Viraj Jasani commented on HDFS-15895:
-

Thanks [~tasanuma]

> DFSAdmin#printOpenFiles has redundant String#format usage
> -
>
> Key: HDFS-15895
> URL: https://issues.apache.org/jira/browse/HDFS-15895
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15850) Superuser actions should be reported to external enforcers

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15850?focusedWorklogId=568217=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568217
 ]

ASF GitHub Bot logged work on HDFS-15850:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 10:04
Start Date: 18/Mar/21 10:04
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2784:
URL: https://github.com/apache/hadoop/pull/2784#issuecomment-801791798


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  7s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 38s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  25m 51s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   5m 44s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   5m 22s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 25s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 14s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 49s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   2m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m  3s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  19m 49s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 23s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 58s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   5m 58s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 19s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/2/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 3 new + 495 unchanged - 1 fixed = 
498 total (was 496)  |
   | +1 :green_heart: |  mvnsite  |   2m  5s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   2m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 41s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 373m 12s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  unit  |  23m 22s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 529m 50s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestStateAlignmentContextWithHA |
   |   | hadoop.hdfs.server.datanode.TestBlockScanner |
   |   | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks |
   |   | hadoop.hdfs.server.namenode.TestFileTruncate |
   |   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
   |   | hadoop.hdfs.TestPersistBlocks |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | hadoop.hdfs.TestDFSShell |
   |   | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots |
   |   | hadoop.hdfs.TestSnapshotCommands |
   |   | 

[jira] [Updated] (HDFS-15894) Trace Time-consuming RPC response of certain threshold.

2021-03-18 Thread Renukaprasad C (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renukaprasad C updated HDFS-15894:
--
Attachment: HDFS-15894.003.patch

> Trace Time-consuming RPC response of certain threshold.
> ---
>
> Key: HDFS-15894
> URL: https://issues.apache.org/jira/browse/HDFS-15894
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Attachments: HDFS-15894.001.patch, HDFS-15894.002.patch, 
> HDFS-15894.003.patch
>
>
> Monitor & Trace Time-consuming RPC requests.
> Sometimes RPC Requests gets delayed, which impacts the system performance. 
> Currently, there is no track for delayed RPC request. 
> We can log such delayed RPC calls which exceeds certain threshold.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14731) [FGL] Remove redundant locking on NameNode.

2021-03-18 Thread Jeffrey(Xilang) Yan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303916#comment-17303916
 ] 

Jeffrey(Xilang) Yan commented on HDFS-14731:


Is posibble to backport this PR to Hadoop 2?

> [FGL] Remove redundant locking on NameNode.
> ---
>
> Key: HDFS-14731
> URL: https://issues.apache.org/jira/browse/HDFS-14731
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14731.001.patch
>
>
> Currently NameNode has two global locks: FSNamesystemLock and 
> FSDirectoryLock. An analysis shows that single FSNamesystemLock is sufficient 
> to guarantee consistency of the NameNode state. FSDirectoryLock can be 
> removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303885#comment-17303885
 ] 

Akira Ajisaka commented on HDFS-15900:
--

bq. if it's ok to have several records sharing the same nameserviceId in 
activeNamespaces,

IMO, there may be multiple active NameNodes if RBF supports Observer NameNodes 
in the future, so it's okay to have several records sharing the same 
nameserviceId in active Namespaces. However, it's not okay to have UNAVAILABLE 
NameNodes registrations in activeNamespaces (I used "we expect" because the 
source code is written in that way, sorry for the confusion).

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
> Attachments: image.png
>
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode

2021-03-18 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned HDFS-15900:


Assignee: Harunobu Daikoku

> RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
> ---
>
> Key: HDFS-15900
> URL: https://issues.apache.org/jira/browse/HDFS-15900
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Harunobu Daikoku
>Assignee: Harunobu Daikoku
>Priority: Major
> Attachments: image.png
>
>
> We observed that when a NameNode becomes UNAVAILABLE, the corresponding 
> blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter 
> unintentionally sets to empty, its initial value.
>  !image.png|height=250!
> As a result of this, concat operations through dfsrouter fail with the 
> following error as it cannot resolve the block id in the recognized active 
> namespaces.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> Cannot locate a nameservice for block pool BP-...
> {noformat}
> A possible fix is to ignore UNAVAILABLE NameNode registrations, and set 
> proper namespace information obtained from available NameNode registrations 
> when constructing the cache of active namespaces.
>  
> [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-18 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303875#comment-17303875
 ] 

Jinglun commented on HDFS-15899:


Submit v02 fix checkstyle. The failed unit tests are not related.

> Remove rpcThreadPool from DeadNodeDetector.
> ---
>
> Key: HDFS-15899
> URL: https://issues.apache.org/jira/browse/HDFS-15899
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch
>
>
> The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
> purpose is to use the thread pool timeout to monitor the probe timeout. But 
> the rpc client already has a timeout. We can use the rpc client timeout 
> instead of the thread pool timeout and remove the rpcThreadPool.
> The rpcThreadPool introduces additional complexity for probing the DataNode. 
> The probe task waiting in the busy rpcThreadPool might exceed the configured 
> timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-18 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15899:
---
Attachment: HDFS-15899.002.patch

> Remove rpcThreadPool from DeadNodeDetector.
> ---
>
> Key: HDFS-15899
> URL: https://issues.apache.org/jira/browse/HDFS-15899
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch
>
>
> The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
> purpose is to use the thread pool timeout to monitor the probe timeout. But 
> the rpc client already has a timeout. We can use the rpc client timeout 
> instead of the thread pool timeout and remove the rpcThreadPool.
> The rpcThreadPool introduces additional complexity for probing the DataNode. 
> The probe task waiting in the busy rpcThreadPool might exceed the configured 
> timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15850) Superuser actions should be reported to external enforcers

2021-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15850?focusedWorklogId=568100=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568100
 ]

ASF GitHub Bot logged work on HDFS-15850:
-

Author: ASF GitHub Bot
Created on: 18/Mar/21 06:12
Start Date: 18/Mar/21 06:12
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2784:
URL: https://github.com/apache/hadoop/pull/2784#issuecomment-801659165


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 52s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 57s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m  3s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   5m 22s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   4m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 16s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 58s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   2m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   4m 25s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m 48s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 44s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m  0s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   5m  0s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   4m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   4m 34s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m  9s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/1/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 2 new + 317 unchanged - 1 fixed = 
319 total (was 318)  |
   | +1 :green_heart: |  mvnsite  |   1m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 16s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   2m  4s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   4m 38s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  16m 50s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 371m 21s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | -1 :x: |  unit  |  25m 48s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 47s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 515m 17s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestBlockScanner |
   |   | hadoop.fs.viewfs.TestViewFileSystemOverloadSchemeWithHdfsScheme |
   |   | hadoop.hdfs.TestViewDistributedFileSystemWithMountLinks |
   |   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
   |   | hadoop.hdfs.TestPersistBlocks |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | hadoop.hdfs.TestDFSShell |
   |   |