[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode
[ https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304646#comment-17304646 ] Wei-Chiu Chuang commented on HDFS-15901: We have some users running 1000+ node scale clusters but I don't watch the clusters every day. I am honestly not the best person for opinions when it comes to extreme scale clusters. [~hexiaoqiao] or [~ferhui] may have better ideas. > Solve the problem of DN repeated block reports occupying too many RPCs during > Safemode > -- > > Key: HDFS-15901 > URL: https://issues.apache.org/jira/browse/HDFS-15901 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When the cluster exceeds thousands of nodes, we want to restart the NameNode > service, and all DataNodes send a full Block action to the NameNode. During > SafeMode, some DataNodes may send blocks to NameNode multiple times, which > will take up too much RPC. In fact, this is unnecessary. > In this case, some block report leases will fail or time out, and in extreme > cases, the NameNode will always stay in Safe Mode. > 2021-03-14 08:16:25,873 [78438700] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(:port, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-14 08:16:31,521 [78444348] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-13 18:35:38,200 [29191027] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. > 2021-03-13 18:36:08,145 [29220972] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode
[ https://issues.apache.org/jira/browse/HDFS-15901?focusedWorklogId=568792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568792 ] ASF GitHub Bot logged work on HDFS-15901: - Author: ASF GitHub Bot Created on: 19/Mar/21 05:39 Start Date: 19/Mar/21 05:39 Worklog Time Spent: 10m Work Description: jojochuang commented on a change in pull request #2782: URL: https://github.com/apache/hadoop/pull/2782#discussion_r597418628 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java ## @@ -2603,6 +2603,24 @@ public long requestBlockReportLeaseId(DatanodeRegistration nodeReg) { LOG.warn("Failed to find datanode {}", nodeReg); return 0; } + +// During safemode, DataNodes are only allowed to report all data once. +if (namesystem.isInStartupSafeMode()) { + boolean allReported = true; + for (DatanodeStorageInfo storageInfo : node.getStorageInfos()) { +if (storageInfo.getBlockReportCount() < 1) { + allReported = false; + break; +} + } + + if (allReported) { +LOG.info("The DataNode has reported all blocks and does not need " + Review comment: nit: print datanode id/ip/address would be helpful here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568792) Time Spent: 0.5h (was: 20m) > Solve the problem of DN repeated block reports occupying too many RPCs during > Safemode > -- > > Key: HDFS-15901 > URL: https://issues.apache.org/jira/browse/HDFS-15901 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When the cluster exceeds thousands of nodes, we want to restart the NameNode > service, and all DataNodes send a full Block action to the NameNode. During > SafeMode, some DataNodes may send blocks to NameNode multiple times, which > will take up too much RPC. In fact, this is unnecessary. > In this case, some block report leases will fail or time out, and in extreme > cases, the NameNode will always stay in Safe Mode. > 2021-03-14 08:16:25,873 [78438700] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(:port, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-14 08:16:31,521 [78444348] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-13 18:35:38,200 [29191027] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. > 2021-03-13 18:36:08,145 [29220972] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15879) Exclude slow nodes when choose targets for blocks
[ https://issues.apache.org/jira/browse/HDFS-15879?focusedWorklogId=568790=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568790 ] ASF GitHub Bot logged work on HDFS-15879: - Author: ASF GitHub Bot Created on: 19/Mar/21 05:35 Start Date: 19/Mar/21 05:35 Worklog Time Spent: 10m Work Description: tasanuma commented on a change in pull request #2748: URL: https://github.com/apache/hadoop/pull/2748#discussion_r597417442 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestReplicationPolicyExcludeSlowNodes.java ## @@ -0,0 +1,125 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs.server.blockmanagement; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.DFSTestUtil; +import org.apache.hadoop.hdfs.TestBlockStoragePolicy; +import org.apache.hadoop.hdfs.server.namenode.NameNode; +import org.apache.hadoop.net.Node; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.Parameterized; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Set; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +@RunWith(Parameterized.class) +public class TestReplicationPolicyExcludeSlowNodes +extends BaseReplicationPolicyTest { + + public TestReplicationPolicyExcludeSlowNodes(String blockPlacementPolicy) { +this.blockPlacementPolicy = blockPlacementPolicy; + } + + @Parameterized.Parameters + public static Iterable data() { +return Arrays.asList(new Object[][] { +{ BlockPlacementPolicyDefault.class.getName() }, +{ BlockPlacementPolicyWithUpgradeDomain.class.getName() } }); + } + + @Override + DatanodeDescriptor[] getDatanodeDescriptors(Configuration conf) { +conf.setBoolean(DFSConfigKeys +.DFS_DATANODE_PEER_STATS_ENABLED_KEY, +true); +conf.setStrings(DFSConfigKeys +.DFS_NAMENODE_SLOWPEER_COLLECT_INTERVAL_KEY, +"1s"); +conf.setBoolean(DFSConfigKeys +.DFS_NAMENODE_BLOCKPLACEMENTPOLICY_EXCLUDE_SLOW_NODES_ENABLED_KEY, +true); +final String[] racks = { +"/rack1", +"/rack1", +"/rack2", +"/rack2", +"/rack3", +"/rack3"}; +storages = DFSTestUtil.createDatanodeStorageInfos(racks); +return DFSTestUtil.toDatanodeDescriptor(storages); + } + + /** + * Tests that chooseTarget when excludeSlowNodesEnabled set to true + */ + @Test + public void testChooseTargetExcludeSlowNodes() throws IOException { +namenode.getNamesystem().writeLock(); +try { + // add nodes + for (int i = 0; i < dataNodes.length; i++) { +dnManager.addDatanode(dataNodes[i]); + } + + // mock slow nodes + SlowPeerTracker tracker = dnManager.getSlowPeerTracker(); + tracker.addReport(dataNodes[0].getInfoAddr(), dataNodes[3].getInfoAddr()); + tracker.addReport(dataNodes[0].getInfoAddr(), dataNodes[4].getInfoAddr()); + tracker.addReport(dataNodes[1].getInfoAddr(), dataNodes[4].getInfoAddr()); + tracker.addReport(dataNodes[1].getInfoAddr(), dataNodes[5].getInfoAddr()); + tracker.addReport(dataNodes[2].getInfoAddr(), dataNodes[3].getInfoAddr()); + tracker.addReport(dataNodes[2].getInfoAddr(), dataNodes[5].getInfoAddr()); + + // fetch slow nodes + Set slowPeers = dnManager.getSlowPeers(); + + // assert slow nodes + assertEquals(3, slowPeers.size()); + for (int i = 0; i < slowPeers.size(); i++) { +assertTrue(slowPeers.contains(dataNodes[i])); + } + + // mock writer + DatanodeDescriptor writerDn = dataNodes[0]; + + // Call chooseTarget() + DatanodeStorageInfo[] targets = namenode.getNamesystem().getBlockManager() + .getBlockPlacementPolicy().chooseTarget("testFile.txt", 3, + writerDn, new ArrayList(), false, null, + 1024, TestBlockStoragePolicy.DEFAULT_STORAGE_POLICY, null); + +
[jira] [Commented] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304642#comment-17304642 ] Mingliang Liu commented on HDFS-15904: -- Approved PR and left some minor message. Not sure about HBase, but in Hadoop, before merging we only need to set target versions for a JIRA. When committing, the commuter will set the "Fixed Versions" to indicate which branch this patch eventually goes into. Thanks, > Flaky test TestBalancer#testBalancerWithSortTopNodes() > -- > > Key: HDFS-15904 > URL: https://issues.apache.org/jira/browse/HDFS-15904 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 > runs or so. It's reproducible locally also. Basically, balancing either moves > 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes > (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304642#comment-17304642 ] Mingliang Liu edited comment on HDFS-15904 at 3/19/21, 5:11 AM: Approved PR and left some minor message. Not sure about HBase, but in Hadoop, before merging we only need to set target versions for a JIRA. When committing, the committer will set the "Fixed Versions" to indicate which branch this patch eventually goes into. Thanks, was (Author: liuml07): Approved PR and left some minor message. Not sure about HBase, but in Hadoop, before merging we only need to set target versions for a JIRA. When committing, the commuter will set the "Fixed Versions" to indicate which branch this patch eventually goes into. Thanks, > Flaky test TestBalancer#testBalancerWithSortTopNodes() > -- > > Key: HDFS-15904 > URL: https://issues.apache.org/jira/browse/HDFS-15904 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 > runs or so. It's reproducible locally also. Basically, balancing either moves > 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes > (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-15904: - Fix Version/s: (was: 3.4.0) Status: Patch Available (was: Open) > Flaky test TestBalancer#testBalancerWithSortTopNodes() > -- > > Key: HDFS-15904 > URL: https://issues.apache.org/jira/browse/HDFS-15904 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 > runs or so. It's reproducible locally also. Basically, balancing either moves > 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes > (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568785=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568785 ] ASF GitHub Bot logged work on HDFS-15904: - Author: ASF GitHub Bot Created on: 19/Mar/21 05:09 Start Date: 19/Mar/21 05:09 Worklog Time Spent: 10m Work Description: liuml07 commented on a change in pull request #2785: URL: https://github.com/apache/hadoop/pull/2785#discussion_r597410126 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer.java ## @@ -2297,7 +2297,9 @@ public void testBalancerWithSortTopNodes() throws Exception { maxUsage = Math.max(maxUsage, datanodeReport[i].getDfsUsed()); } -assertEquals(200, balancerResult.bytesAlreadyMoved); +// Either 2 blocks of 100+100 bytes or 3 blocks of 100+100+50 bytes Review comment: Could add some explanation why this would happen. The 95% usage DN will have 9 blocks of 100 bytes and 1 block of 50 byte - all for the same file. The HDFS balancer will choose a block to move from this node randomly. More likely it will be 100B block. Since that is greater than `DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY` which is 99L (see above settings), it will stop here. Total bytes moved from this 95% DN will be 1 block and hence 100B. However, chances are the first block to move from this 95% DN is the 50B block. After this block being moved, the total moved size 50B is smaller than `DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY` , it will try to move another block. The second block will always be 100 bytes. So total bytes moved from this 95% DN will be 2 blocks and hence 150B (100B + 50B). Please reword or rephrase this as comment before this assertion so readers can have more context without thinking too much again. Thanks, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568785) Time Spent: 50m (was: 40m) > Flaky test TestBalancer#testBalancerWithSortTopNodes() > -- > > Key: HDFS-15904 > URL: https://issues.apache.org/jira/browse/HDFS-15904 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 > runs or so. It's reproducible locally also. Basically, balancing either moves > 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes > (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15894) Trace Time-consuming RPC response of certain threshold.
[ https://issues.apache.org/jira/browse/HDFS-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304633#comment-17304633 ] Renukaprasad C commented on HDFS-15894: --- Uploaded patch - HDFS-15894.003.patch with all static check fix. Failed test are not related to the code changes done. [~surendralilhore] Can you please help to review the changes? > Trace Time-consuming RPC response of certain threshold. > --- > > Key: HDFS-15894 > URL: https://issues.apache.org/jira/browse/HDFS-15894 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Attachments: HDFS-15894.001.patch, HDFS-15894.002.patch, > HDFS-15894.003.patch > > > Monitor & Trace Time-consuming RPC requests. > Sometimes RPC Requests gets delayed, which impacts the system performance. > Currently, there is no track for delayed RPC request. > We can log such delayed RPC calls which exceeds certain threshold. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568765=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568765 ] ASF GitHub Bot logged work on HDFS-15904: - Author: ASF GitHub Bot Created on: 19/Mar/21 03:55 Start Date: 19/Mar/21 03:55 Worklog Time Spent: 10m Work Description: virajjasani commented on pull request #2785: URL: https://github.com/apache/hadoop/pull/2785#issuecomment-802532273 Could you please take a look @liuml07 @tasanuma ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568765) Time Spent: 40m (was: 0.5h) > Flaky test TestBalancer#testBalancerWithSortTopNodes() > -- > > Key: HDFS-15904 > URL: https://issues.apache.org/jira/browse/HDFS-15904 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 > runs or so. It's reproducible locally also. Basically, balancing either moves > 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes > (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568755=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568755 ] ASF GitHub Bot logged work on HDFS-15900: - Author: ASF GitHub Bot Created on: 19/Mar/21 03:40 Start Date: 19/Mar/21 03:40 Worklog Time Spent: 10m Work Description: goiri commented on a change in pull request #2787: URL: https://github.com/apache/hadoop/pull/2787#discussion_r597386195 ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/resolver/FederationNamespaceInfo.java ## @@ -75,4 +76,27 @@ public String getBlockPoolId() { public String toString() { return this.nameserviceId + "->" + this.blockPoolId + ":" + this.clusterId; } -} \ No newline at end of file + + @Override + public boolean equals(Object obj) { +if (this == obj) { + return true; +} +if (obj instanceof FederationNamespaceInfo) { Review comment: There is also EqualsBuilder in commons. ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java ## @@ -213,12 +213,14 @@ public boolean loadCache(boolean force) throws IOException { nnRegistrations.put(nnId, nnRegistration); } nnRegistration.add(membership); - String bpId = membership.getBlockPoolId(); - String cId = membership.getClusterId(); - String nsId = membership.getNameserviceId(); - FederationNamespaceInfo nsInfo = - new FederationNamespaceInfo(bpId, cId, nsId); - this.activeNamespaces.add(nsInfo); + if (membership.getState() != FederationNamenodeServiceState.UNAVAILABLE) { +String bpId = membership.getBlockPoolId(); Review comment: Is there any test we can do for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568755) Time Spent: 40m (was: 0.5h) > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Labels: pull-request-available > Attachments: image.png > > Time Spent: 40m > Remaining Estimate: 0h > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-15900: --- Status: Patch Available (was: Open) > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Labels: pull-request-available > Attachments: image.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568744=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568744 ] ASF GitHub Bot logged work on HDFS-15900: - Author: ASF GitHub Bot Created on: 19/Mar/21 03:15 Start Date: 19/Mar/21 03:15 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2787: URL: https://github.com/apache/hadoop/pull/2787#issuecomment-802513718 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 37s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 35s | | trunk passed | | +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 0m 36s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 28s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 42s | | trunk passed | | +1 :green_heart: | javadoc | 0m 42s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 17s | | trunk passed | | +1 :green_heart: | shadedclient | 14m 13s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 33s | | the patch passed | | +1 :green_heart: | compile | 0m 33s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 0m 33s | | the patch passed | | +1 :green_heart: | compile | 0m 29s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 0m 29s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 18s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2787/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | | +1 :green_heart: | mvnsite | 0m 31s | | the patch passed | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 0m 47s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 15s | | the patch passed | | +1 :green_heart: | shadedclient | 14m 5s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 17m 37s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2787/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 91m 38s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouterRpcMultiDestination | | | hadoop.hdfs.server.federation.router.TestRouterRpc | | | hadoop.hdfs.server.federation.resolver.TestFederationNamespaceInfo | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2787/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2787 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux 52e780355c42 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 8983b9f309e260d69c4fa932e38aaf418da515c2 | | Default
[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode
[ https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304607#comment-17304607 ] JiangHua Zhu commented on HDFS-15901: - [~kihwal], thank you very much for your message. I think the FBR lease mechanism is still needed, because it will reduce the pressure on NN. It is just that during the restart of the NN, after the FBR for the DN is completed once, the NN should not allow the DN to complete a new FBR action again. [~weichiu] , do you have any other good opinions? > Solve the problem of DN repeated block reports occupying too many RPCs during > Safemode > -- > > Key: HDFS-15901 > URL: https://issues.apache.org/jira/browse/HDFS-15901 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When the cluster exceeds thousands of nodes, we want to restart the NameNode > service, and all DataNodes send a full Block action to the NameNode. During > SafeMode, some DataNodes may send blocks to NameNode multiple times, which > will take up too much RPC. In fact, this is unnecessary. > In this case, some block report leases will fail or time out, and in extreme > cases, the NameNode will always stay in Safe Mode. > 2021-03-14 08:16:25,873 [78438700] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(:port, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-14 08:16:31,521 [78444348] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-13 18:35:38,200 [29191027] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. > 2021-03-13 18:36:08,145 [29220972] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304580#comment-17304580 ] Harunobu Daikoku commented on HDFS-15900: - Thanks for explanation. I have done the two fixes above and submitted the patch. > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Labels: pull-request-available > Attachments: image.png > > Time Spent: 20m > Remaining Estimate: 0h > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568728=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568728 ] ASF GitHub Bot logged work on HDFS-15900: - Author: ASF GitHub Bot Created on: 19/Mar/21 01:46 Start Date: 19/Mar/21 01:46 Worklog Time Spent: 10m Work Description: hdaikoku commented on a change in pull request #2787: URL: https://github.com/apache/hadoop/pull/2787#discussion_r597352892 ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/resolver/TestFederationNamespaceInfo.java ## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs.server.federation.resolver; + +import org.junit.Test; + +import java.util.Set; +import java.util.TreeSet; + +import static org.assertj.core.api.Assertions.assertThat; + +public class TestFederationNamespaceInfo { + /** + * Regression test for HDFS-15900. Review comment: Provided courtesy of @aajisaka -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568728) Time Spent: 20m (was: 10m) > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Labels: pull-request-available > Attachments: image.png > > Time Spent: 20m > Remaining Estimate: 0h > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?focusedWorklogId=568726=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568726 ] ASF GitHub Bot logged work on HDFS-15900: - Author: ASF GitHub Bot Created on: 19/Mar/21 01:42 Start Date: 19/Mar/21 01:42 Worklog Time Spent: 10m Work Description: hdaikoku opened a new pull request #2787: URL: https://github.com/apache/hadoop/pull/2787 https://issues.apache.org/jira/browse/HDFS-15900 ## NOTICE Please create an issue in ASF JIRA before opening a pull request, and you need to set the title of the pull request which starts with the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.) For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568726) Remaining Estimate: 0h Time Spent: 10m > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Attachments: image.png > > Time Spent: 10m > Remaining Estimate: 0h > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-15900: -- Labels: pull-request-available (was: ) > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Labels: pull-request-available > Attachments: image.png > > Time Spent: 10m > Remaining Estimate: 0h > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568723=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568723 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 19/Mar/21 01:39 Start Date: 19/Mar/21 01:39 Worklog Time Spent: 10m Work Description: runitao commented on a change in pull request #2585: URL: https://github.com/apache/hadoop/pull/2585#discussion_r597350941 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/erasurecode/StripedBlockReconstructor.java ## @@ -126,12 +128,26 @@ private void reconstructTargets(int toReconstructLen) throws IOException { int[] erasedIndices = stripedWriter.getRealTargetIndices(); ByteBuffer[] outputs = stripedWriter.getRealTargetBuffers(toReconstructLen); +if (isValidationEnabled()) { + markBuffers(inputs); + decode(inputs, erasedIndices, outputs); + resetBuffers(inputs); + + DataNodeFaultInjector.get().badDecoding(outputs); + getValidator().validate(inputs, erasedIndices, outputs); Review comment: +1,I have no other suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568723) Time Spent: 5h 40m (was: 5.5h) > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Time Spent: 5h 40m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568724 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 19/Mar/21 01:39 Start Date: 19/Mar/21 01:39 Worklog Time Spent: 10m Work Description: runitao commented on a change in pull request #2585: URL: https://github.com/apache/hadoop/pull/2585#discussion_r597351035 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestReconstructStripedFileWithValidator.java ## @@ -0,0 +1,98 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs; + +import org.apache.hadoop.hdfs.server.datanode.DataNodeFaultInjector; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.nio.ByteBuffer; +import java.util.concurrent.atomic.AtomicBoolean; + +/** + * This test extends {@link TestReconstructStripedFile} to test + * ec reconstruction validation. + */ +public class TestReconstructStripedFileWithValidator +extends TestReconstructStripedFile { + private static final Logger LOG = + LoggerFactory.getLogger(TestReconstructStripedFileWithValidator.class); + + public TestReconstructStripedFileWithValidator() { +LOG.info("run {} with validator.", +TestReconstructStripedFileWithValidator.class.getSuperclass() +.getSimpleName()); + } + + /** + * This test injects data pollution into decoded outputs once. + * When validation enabled, the first reconstruction task should fail + * in the validation, but the data will be recovered correctly + * by the next task. + * On the other hand, when validation disabled, the first reconstruction task + * will succeed and then lead to data corruption. + */ + @Test(timeout = 12) + public void testValidatorWithBadDecoding() + throws Exception { +DataNodeFaultInjector oldInjector = DataNodeFaultInjector.get(); +DataNodeFaultInjector badDecodingInjector = new DataNodeFaultInjector() { + private final AtomicBoolean flag = new AtomicBoolean(false); + + @Override + public void badDecoding(ByteBuffer[] outputs) { +if (!flag.get()) { + for (ByteBuffer output : outputs) { +output.mark(); +output.put((byte) (output.get(output.position()) + 1)); +output.reset(); + } +} +flag.set(true); + } +}; +DataNodeFaultInjector.set(badDecodingInjector); +int fileLen = +(getEcPolicy().getNumDataUnits() + getEcPolicy().getNumParityUnits()) +* getBlockSize() + getBlockSize() / 10; +try { + assertFileBlocksReconstruction( + "/testValidatorWithBadDecoding", + fileLen, + ReconstructionType.DataOnly, + getEcPolicy().getNumParityUnits()); Review comment: +1 too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568724) Time Spent: 5h 50m (was: 5h 40m) > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Time Spent: 5h 50m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously >
[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode
[ https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304503#comment-17304503 ] Kihwal Lee edited comment on HDFS-15901 at 3/18/21, 10:32 PM: -- The block report lease feature is supposed to improve this, but it ended up causing more problems in our experiences. One of the main reasons of duplicate reporting is lack of ability to retransmit single report on rpc timeout. On startup, the NN's call queue can be easily overwhelmed since the FBR processing is relatively slow. It is common to see the processing of a single storage taking 100s of milliseconds. A half dozen storage reports can take up a while second. You can easily imagine more than 60 seconds worth of reports waiting in the call queue, which will cause a timeout for some of the reports. Unfortunately, datanode's full block reporting does not retransmit the affected report only. It regenerates the whole thing and start all over again. Even if only the last storage FBR had a trouble, it will retransmit everything again. The reason why it sometimes stuck in safe mode is likely the curse of the block report lease. When FBR is retransmitted, the feature will make the NN to drop the reports. We have seen this happening in big clusters. If the block report lease wasn't there, it wouldn't have stuck in safe mode. We have recently gut out the FBR lease feature internally and implemented a new block report flow control system. It was designed by [~daryn]. It hasn't been tested fully yet, so we haven't shared it with the community. was (Author: kihwal): The block report lease feature is supposed to improve this, but it ended up causing more problems in our experiences. One of the main reasons of duplicate reporting is lack of ability to retransmit single report on rpc timeout. On startup, the NN's call queue can be easily overwhelmed since the FBR processing relatively slow. It is common to see a processing of a single storage taking 100s of milliseconds. A half dozen storage reports can take up a while second. If you have enough in the call queue, the queue time can easily exceed the 60 second timeout for some of the nodes. Unfortunately, datanode's full block reporting does not retransmit the affected report only. It regenerates the whole thing and start all over again. Even if only the last storage FBR had a trouble, it will retransmit everything again. The reason why it sometimes stuck in safe mode is likely the curse of the block report lease. When FBR is retransmitted, the feature will make the NN to drop the reports. We have seen this happening in big clusters. If the block report lease wasn't there, it wouldn't have stuck in safe mode. We have recently gut out the FBR lease feature internally and implemented a new block report flow control system. It was designed by [~daryn]. It hasn't been tested fully yet, so we haven't shared it with the community. > Solve the problem of DN repeated block reports occupying too many RPCs during > Safemode > -- > > Key: HDFS-15901 > URL: https://issues.apache.org/jira/browse/HDFS-15901 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When the cluster exceeds thousands of nodes, we want to restart the NameNode > service, and all DataNodes send a full Block action to the NameNode. During > SafeMode, some DataNodes may send blocks to NameNode multiple times, which > will take up too much RPC. In fact, this is unnecessary. > In this case, some block report leases will fail or time out, and in extreme > cases, the NameNode will always stay in Safe Mode. > 2021-03-14 08:16:25,873 [78438700] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(:port, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-14 08:16:31,521 [78444348] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-13 18:35:38,200 [29191027] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending
[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode
[ https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304503#comment-17304503 ] Kihwal Lee commented on HDFS-15901: --- The block report lease feature is supposed to improve this, but it ended up causing more problems in our experiences. One of the main reasons of duplicate reporting is lack of ability to retransmit single report on rpc timeout. On startup, the NN's call queue can be easily overwhelmed since the FBR processing relatively slow. It is common to see a processing of a single storage taking 100s of milliseconds. A half dozen storage reports can take up a while second. If you have enough in the call queue, the queue time can easily exceed the 60 second timeout for some of the nodes. Unfortunately, datanode's full block reporting does not retransmit the affected report only. It regenerates the whole thing and start all over again. Even if only the last storage FBR had a trouble, it will retransmit everything again. The reason why it sometimes stuck in safe mode is likely the curse of the block report lease. When FBR is retransmitted, the feature will make the NN to drop the reports. We have seen this happening in big clusters. If the block report lease wasn't there, it wouldn't have stuck in safe mode. We have recently gut out the FBR lease feature internally and implemented a new block report flow control system. It was designed by [~daryn]. It hasn't been tested fully yet, so we haven't shared it with the community. > Solve the problem of DN repeated block reports occupying too many RPCs during > Safemode > -- > > Key: HDFS-15901 > URL: https://issues.apache.org/jira/browse/HDFS-15901 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When the cluster exceeds thousands of nodes, we want to restart the NameNode > service, and all DataNodes send a full Block action to the NameNode. During > SafeMode, some DataNodes may send blocks to NameNode multiple times, which > will take up too much RPC. In fact, this is unnecessary. > In this case, some block report leases will fail or time out, and in extreme > cases, the NameNode will always stay in Safe Mode. > 2021-03-14 08:16:25,873 [78438700] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(:port, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-14 08:16:31,521 [78444348] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-13 18:35:38,200 [29191027] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. > 2021-03-13 18:36:08,145 [29220972] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15905) RBF: Improve Router performance with router redirection
[ https://issues.apache.org/jira/browse/HDFS-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1730#comment-1730 ] Íñigo Goiri commented on HDFS-15905: If I understand correctly, the proposal is to extend the client to query the Router and then contact the subcluster directly? To be honest, this is very similar to ViewFs; you could potentially extend ViewFs to request the mount table from the Router. I'll let other chime in on the token aspect. Regarding the performance issues, how many Routers are you using for how many namenodes? > RBF: Improve Router performance with router redirection > --- > > Key: HDFS-15905 > URL: https://issues.apache.org/jira/browse/HDFS-15905 > Project: Hadoop HDFS > Issue Type: New Feature > Components: rbf >Affects Versions: 3.1.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > > Router implementation currently takes the proxy approach to handle the client > requests: the routers receive the requests from the clients and send the > requests to the target clusters on behalf of the clients. > This approach works well, while after moving more clusters on top of > routers, we are seeing that routers are becoming the bottleneck since e.g., > without RBF, the clients themselves manage the connections for themselves, > while with RBF, the limited routers manage much more connections for the > clients; we also keep idle connections to boost the connection performance. > We have done some work to tune connection management but it doesn't help much. > We are proposing to reduce the functionality on the router side and use them > as actual router instead of proxy: the clients talk to routers to resolve > target cluster info given a path and get router delegation token; the clients > directly send the requests to target cluster. > A big challenge here is the token authentication against target cluster with > router token only. One approach: we can ask router to return target cluster > token along with router token so the clients can authenticate against target > cluster. Second approach: similar to block token mechanism, the router > exchanges secret keys with target clusters through heart-beats so the clients > can authenticate with target cluster with that router token. > I would like to know your feedback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15905) RBF: Improve Router performance with router redirection
[ https://issues.apache.org/jira/browse/HDFS-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-15905: --- Summary: RBF: Improve Router performance with router redirection (was: Improve Router performance with router redirection) > RBF: Improve Router performance with router redirection > --- > > Key: HDFS-15905 > URL: https://issues.apache.org/jira/browse/HDFS-15905 > Project: Hadoop HDFS > Issue Type: New Feature > Components: rbf >Affects Versions: 3.1.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > > Router implementation currently takes the proxy approach to handle the client > requests: the routers receive the requests from the clients and send the > requests to the target clusters on behalf of the clients. > This approach works well, while after moving more clusters on top of > routers, we are seeing that routers are becoming the bottleneck since e.g., > without RBF, the clients themselves manage the connections for themselves, > while with RBF, the limited routers manage much more connections for the > clients; we also keep idle connections to boost the connection performance. > We have done some work to tune connection management but it doesn't help much. > We are proposing to reduce the functionality on the router side and use them > as actual router instead of proxy: the clients talk to routers to resolve > target cluster info given a path and get router delegation token; the clients > directly send the requests to target cluster. > A big challenge here is the token authentication against target cluster with > router token only. One approach: we can ask router to return target cluster > token along with router token so the clients can authenticate against target > cluster. Second approach: similar to block token mechanism, the router > exchanges secret keys with target clusters through heart-beats so the clients > can authenticate with target cluster with that router token. > I would like to know your feedback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568603=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568603 ] ASF GitHub Bot logged work on HDFS-15904: - Author: ASF GitHub Bot Created on: 18/Mar/21 19:13 Start Date: 18/Mar/21 19:13 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2785: URL: https://github.com/apache/hadoop/pull/2785#issuecomment-802218023 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 59s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 2s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 34m 46s | | trunk passed | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 11s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 3s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 21s | | trunk passed | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 23s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 17s | | trunk passed | | +1 :green_heart: | shadedclient | 18m 40s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 13s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 9s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 9s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 55s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 272 unchanged - 4 fixed = 272 total (was 276) | | +1 :green_heart: | mvnsite | 1m 14s | | the patch passed | | +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 16s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 20s | | the patch passed | | +1 :green_heart: | shadedclient | 19m 7s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 363m 53s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 36s | | The patch does not generate ASF License warnings. | | | | 457m 5s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList | | | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots | | | hadoop.hdfs.server.datanode.TestBlockScanner | | | hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.TestPersistBlocks | | | hadoop.hdfs.TestViewDistributedFileSystem | | | hadoop.hdfs.server.namenode.TestDecommissioningStatus | | | hadoop.hdfs.TestViewDistributedFileSystemContract | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2785 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle
[jira] [Work logged] (HDFS-15868) Possible Resource Leak in EditLogFileOutputStream
[ https://issues.apache.org/jira/browse/HDFS-15868?focusedWorklogId=568591=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568591 ] ASF GitHub Bot logged work on HDFS-15868: - Author: ASF GitHub Bot Created on: 18/Mar/21 18:54 Start Date: 18/Mar/21 18:54 Worklog Time Spent: 10m Work Description: Nargeshdb commented on pull request #2736: URL: https://github.com/apache/hadoop/pull/2736#issuecomment-802206225 Should we expect [these tests](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2736/5/testReport/) pass or not? I checked test failures in [this](https://github.com/apache/hadoop/pull/2784) PR that was made yesterday and I found 8 common failures between these two PRs. I was wondering if you could confirm that these failures are excepted. @Hexiaoqiao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568591) Time Spent: 2h 20m (was: 2h 10m) > Possible Resource Leak in EditLogFileOutputStream > - > > Key: HDFS-15868 > URL: https://issues.apache.org/jira/browse/HDFS-15868 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Narges Shadab >Assignee: Narges Shadab >Priority: Major > Labels: pull-request-available > Time Spent: 2h 20m > Remaining Estimate: 0h > > We noticed a possible resource leak > [here|https://github.com/apache/hadoop/blob/1f1a1ef52df896a2b66b16f5bbc17aa39b1a1dd7/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java#L91]. > If an I/O error occurs at line 91, rp remains open since the exception isn't > caught locally, and there is no way for any caller to close the > RandomAccessFile. > I'll submit a pull request to fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568589=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568589 ] ASF GitHub Bot logged work on HDFS-15904: - Author: ASF GitHub Bot Created on: 18/Mar/21 18:50 Start Date: 18/Mar/21 18:50 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2785: URL: https://github.com/apache/hadoop/pull/2785#issuecomment-802203762 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 2s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 34m 57s | | trunk passed | | +1 :green_heart: | compile | 1m 21s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 12s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 59s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 21s | | trunk passed | | +1 :green_heart: | javadoc | 0m 53s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 22s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 16s | | trunk passed | | +1 :green_heart: | shadedclient | 18m 31s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 12s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 6s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 6s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 55s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 272 unchanged - 4 fixed = 272 total (was 276) | | +1 :green_heart: | mvnsite | 1m 13s | | the patch passed | | +1 :green_heart: | javadoc | 0m 49s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 19s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 18m 37s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 342m 45s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 435m 13s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS | | | hadoop.hdfs.server.datanode.TestIncrementalBrVariations | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList | | | hadoop.hdfs.server.datanode.TestBlockScanner | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.TestPersistBlocks | | | hadoop.hdfs.server.namenode.TestDecommissioningStatus | | | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2785/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2785 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs
[jira] [Work logged] (HDFS-15868) Possible Resource Leak in EditLogFileOutputStream
[ https://issues.apache.org/jira/browse/HDFS-15868?focusedWorklogId=568528=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568528 ] ASF GitHub Bot logged work on HDFS-15868: - Author: ASF GitHub Bot Created on: 18/Mar/21 17:43 Start Date: 18/Mar/21 17:43 Worklog Time Spent: 10m Work Description: Nargeshdb commented on pull request #2736: URL: https://github.com/apache/hadoop/pull/2736#issuecomment-802156536 @Hexiaoqiao We are investigating the test failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568528) Time Spent: 2h 10m (was: 2h) > Possible Resource Leak in EditLogFileOutputStream > - > > Key: HDFS-15868 > URL: https://issues.apache.org/jira/browse/HDFS-15868 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Narges Shadab >Assignee: Narges Shadab >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > We noticed a possible resource leak > [here|https://github.com/apache/hadoop/blob/1f1a1ef52df896a2b66b16f5bbc17aa39b1a1dd7/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java#L91]. > If an I/O error occurs at line 91, rp remains open since the exception isn't > caught locally, and there is no way for any caller to close the > RandomAccessFile. > I'll submit a pull request to fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.
[ https://issues.apache.org/jira/browse/HDFS-15874?focusedWorklogId=568521=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568521 ] ASF GitHub Bot logged work on HDFS-15874: - Author: ASF GitHub Bot Created on: 18/Mar/21 17:36 Start Date: 18/Mar/21 17:36 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2744: URL: https://github.com/apache/hadoop/pull/2744#issuecomment-802151603 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 36s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 3 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 20s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 7s | | trunk passed | | +1 :green_heart: | compile | 4m 51s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 4m 27s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 19s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 2s | | trunk passed | | +1 :green_heart: | javadoc | 1m 35s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 24s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 20s | | trunk passed | | +1 :green_heart: | shadedclient | 13m 59s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 27s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 41s | | the patch passed | | +1 :green_heart: | compile | 4m 45s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 4m 45s | | the patch passed | | +1 :green_heart: | compile | 4m 22s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 4m 22s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 11s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 14 new + 697 unchanged - 1 fixed = 711 total (was 698) | | +1 :green_heart: | mvnsite | 1m 45s | | the patch passed | | +1 :green_heart: | xml | 0m 1s | | The patch has no ill-formed XML file. | | +1 :green_heart: | javadoc | 1m 17s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 13s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 14m 3s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 227m 18s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | unit | 17m 41s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 44s | | The patch does not generate ASF License warnings. | | | | 353m 55s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2744 | | JIRA Issue | HDFS-15874 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell xml | | uname | Linux 52dbd6584158 4.15.0-58-generic
[jira] [Commented] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.
[ https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304307#comment-17304307 ] Hadoop QA commented on HDFS-15874: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 36s{color} | | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} codespell {color} | {color:blue} 0m 0s{color} | | {color:blue} codespell was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 14m 20s{color} | | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 7s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 51s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 27s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 19s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 2s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 24s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 4m 20s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 59s{color} | | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 41s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 45s{color} | | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 45s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 22s{color} | | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 22s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} blanks {color} | {color:green} 0m 0s{color} | | {color:green} The patch has no blanks issues. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 11s{color} | [/results-checkstyle-hadoop-hdfs-project.txt|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2744/5/artifact/out/results-checkstyle-hadoop-hdfs-project.txt] | {color:orange} hadoop-hdfs-project: The patch generated 14 new + 697 unchanged - 1 fixed = 711 total (was 698) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 45s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s{color} | | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 13s{color} | | {color:green} the patch passed with JDK Private
[jira] [Commented] (HDFS-15905) Improve Router performance with router redirection
[ https://issues.apache.org/jira/browse/HDFS-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304294#comment-17304294 ] Aihua Xu commented on HDFS-15905: - [~elgoiri], [~jingzhao], [~fengnanli] Can you provide any feedback/suggestion? Thanks a lot. > Improve Router performance with router redirection > -- > > Key: HDFS-15905 > URL: https://issues.apache.org/jira/browse/HDFS-15905 > Project: Hadoop HDFS > Issue Type: New Feature > Components: rbf >Affects Versions: 3.1.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > > Router implementation currently takes the proxy approach to handle the client > requests: the routers receive the requests from the clients and send the > requests to the target clusters on behalf of the clients. > This approach works well, while after moving more clusters on top of > routers, we are seeing that routers are becoming the bottleneck since e.g., > without RBF, the clients themselves manage the connections for themselves, > while with RBF, the limited routers manage much more connections for the > clients; we also keep idle connections to boost the connection performance. > We have done some work to tune connection management but it doesn't help much. > We are proposing to reduce the functionality on the router side and use them > as actual router instead of proxy: the clients talk to routers to resolve > target cluster info given a path and get router delegation token; the clients > directly send the requests to target cluster. > A big challenge here is the token authentication against target cluster with > router token only. One approach: we can ask router to return target cluster > token along with router token so the clients can authenticate against target > cluster. Second approach: similar to block token mechanism, the router > exchanges secret keys with target clusters through heart-beats so the clients > can authenticate with target cluster with that router token. > I would like to know your feedback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15905) Improve Router performance with router redirection
Aihua Xu created HDFS-15905: --- Summary: Improve Router performance with router redirection Key: HDFS-15905 URL: https://issues.apache.org/jira/browse/HDFS-15905 Project: Hadoop HDFS Issue Type: New Feature Components: rbf Affects Versions: 3.1.0 Reporter: Aihua Xu Assignee: Aihua Xu Router implementation currently takes the proxy approach to handle the client requests: the routers receive the requests from the clients and send the requests to the target clusters on behalf of the clients. This approach works well, while after moving more clusters on top of routers, we are seeing that routers are becoming the bottleneck since e.g., without RBF, the clients themselves manage the connections for themselves, while with RBF, the limited routers manage much more connections for the clients; we also keep idle connections to boost the connection performance. We have done some work to tune connection management but it doesn't help much. We are proposing to reduce the functionality on the router side and use them as actual router instead of proxy: the clients talk to routers to resolve target cluster info given a path and get router delegation token; the clients directly send the requests to target cluster. A big challenge here is the token authentication against target cluster with router token only. One approach: we can ask router to return target cluster token along with router token so the clients can authenticate against target cluster. Second approach: similar to block token mechanism, the router exchanges secret keys with target clusters through heart-beats so the clients can authenticate with target cluster with that router token. I would like to know your feedback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15894) Trace Time-consuming RPC response of certain threshold.
[ https://issues.apache.org/jira/browse/HDFS-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304280#comment-17304280 ] Hadoop QA commented on HDFS-15894: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 12s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 59s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 33s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 23s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 34s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 25m 44s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 3m 32s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 25s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 29s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 29s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 18s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 18s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 8s{color} | {color:green}{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 512 unchanged - 1 fixed = 512 total (was 513) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 26s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green}{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 32s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color}
[jira] [Work logged] (HDFS-15903) Refactor X-Platform library
[ https://issues.apache.org/jira/browse/HDFS-15903?focusedWorklogId=568479=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568479 ] ASF GitHub Bot logged work on HDFS-15903: - Author: ASF GitHub Bot Created on: 18/Mar/21 16:40 Start Date: 18/Mar/21 16:40 Worklog Time Spent: 10m Work Description: GauthamBanasandra commented on pull request #2783: URL: https://github.com/apache/hadoop/pull/2783#issuecomment-802097264 @aajisaka could you also please review my PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568479) Time Spent: 50m (was: 40m) > Refactor X-Platform library > --- > > Key: HDFS-15903 > URL: https://issues.apache.org/jira/browse/HDFS-15903 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs++ >Affects Versions: 3.2.2 >Reporter: Gautham Banasandra >Assignee: Gautham Banasandra >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > X-Platform started out as a utility to help in writing cross platform code in > Hadoop. As its scope expanding to cover various scenarios, it is necessary to > refactor it in early stages to provide proper organization and growth of the > X-Platform library. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304175#comment-17304175 ] Hadoop QA commented on HDFS-15899: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 23s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 40s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 56s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 4s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 43s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 32s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 56s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 27m 32s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 5m 39s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 4s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 42s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 42s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 45s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 45s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 8s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 0s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green}{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient
[jira] [Commented] (HDFS-13975) TestBalancer#testMaxIterationTime fails sporadically
[ https://issues.apache.org/jira/browse/HDFS-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304146#comment-17304146 ] Toshihiko Uchida commented on HDFS-13975: - [~aajisaka] Thanks for your review and commit, too! > TestBalancer#testMaxIterationTime fails sporadically > > > Key: HDFS-13975 > URL: https://issues.apache.org/jira/browse/HDFS-13975 > Project: Hadoop HDFS > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Jason Darrell Lowe >Assignee: Toshihiko Uchida >Priority: Major > Labels: flaky-test, pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3 > > Time Spent: 40m > Remaining Estimate: 0h > > A number of precommit builds have seen this test fail like this: > {noformat} > java.lang.AssertionError: Unexpected iteration runtime: 4021ms > 3.5s > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.hdfs.server.balancer.TestBalancer.testMaxIterationTime(TestBalancer.java:1649) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568299=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568299 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 18/Mar/21 12:24 Start Date: 18/Mar/21 12:24 Worklog Time Spent: 10m Work Description: touchida commented on a change in pull request #2585: URL: https://github.com/apache/hadoop/pull/2585#discussion_r596818639 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/erasurecode/StripedBlockReconstructor.java ## @@ -126,12 +128,26 @@ private void reconstructTargets(int toReconstructLen) throws IOException { int[] erasedIndices = stripedWriter.getRealTargetIndices(); ByteBuffer[] outputs = stripedWriter.getRealTargetBuffers(toReconstructLen); +if (isValidationEnabled()) { + markBuffers(inputs); + decode(inputs, erasedIndices, outputs); + resetBuffers(inputs); + + DataNodeFaultInjector.get().badDecoding(outputs); + getValidator().validate(inputs, erasedIndices, outputs); Review comment: @runitao Thanks for your comment! How about adding a metric for the exception like `EcInvalidReconstructionTasks`? (I saw your deleted comment.) As for logging, I think it's better to output more messages through the entire EC reconstruction process, and so I'd like to handle it in another issue. Are you suggesting anything else? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568299) Time Spent: 5h (was: 4h 50m) > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Time Spent: 5h > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568307=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568307 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 18/Mar/21 12:31 Start Date: 18/Mar/21 12:31 Worklog Time Spent: 10m Work Description: touchida commented on a change in pull request #2585: URL: https://github.com/apache/hadoop/pull/2585#discussion_r596823600 ## File path: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/erasurecode/rawcoder/DecodingValidator.java ## @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.io.erasurecode.rawcoder; + +import org.apache.hadoop.classification.InterfaceAudience; +import org.apache.hadoop.thirdparty.com.google.common.annotations.VisibleForTesting; +import org.apache.hadoop.io.erasurecode.ECChunk; + +import java.io.IOException; +import java.nio.ByteBuffer; + +/** + * A utility class to validate decoding. + */ +@InterfaceAudience.Private +public class DecodingValidator { + + private final RawErasureDecoder decoder; + private ByteBuffer buffer; + private int[] newValidIndexes; + private int newErasedIndex; + + public DecodingValidator(RawErasureDecoder decoder) { +this.decoder = decoder; + } + + /** + * Validate outputs decoded from inputs, by decoding an input back from + * the outputs and comparing it with the original one. + * + * For instance, in RS (6, 3), let (d0, d1, d2, d3, d4, d5) be sources + * and (p0, p1, p2) be parities, and assume + * inputs = [d0, null (d1), d2, d3, d4, d5, null (p0), p1, null (p2)]; + * erasedIndexes = [1, 6]; + * outputs = [d1, p1]. + * Then + * 1. Create new inputs, erasedIndexes and outputs for validation so that + * the inputs could contain the decoded outputs, and decode them: + * newInputs = [d1, d2, d3, d4, d5, p1] + * newErasedIndexes = [0] + * newOutputs = [d0'] + * 2. Compare d0 and d0'. The comparison will fail with high probability + * when the initial outputs are wrong. + * + * Note that the input buffers' positions must be the ones where data are + * read: If the input buffers have been processed by a decoder, the buffers' + * positions must be reset before being passed into this method. + * + * This method does not change outputs and erasedIndexes. + * + * @param inputs input buffers used for decoding. The buffers' position + * are moved to the end after this method. + * @param erasedIndexes indexes of erased units used for decoding + * @param outputs decoded output buffers, which are ready to be read after + *the call + * @throws IOException + */ + public void validate(ByteBuffer[] inputs, int[] erasedIndexes, + ByteBuffer[] outputs) throws IOException { +markBuffers(outputs); + +try { + ByteBuffer validInput = CoderUtil.findFirstValidInput(inputs); + boolean isDirect = validInput.isDirect(); + int capacity = validInput.capacity(); + int remaining = validInput.remaining(); + + // Init buffer + if (buffer == null) { +buffer = allocateBuffer(isDirect, capacity); + } else if (buffer.isDirect() != isDirect + || buffer.capacity() < remaining) { +buffer = allocateBuffer(isDirect, capacity); + } Review comment: @aajisaka Thanks for your review and suggestion! It simplifies the code. I'll fix it up. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568307) Time Spent: 5.5h (was: 5h 20m) > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 >
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568301 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 18/Mar/21 12:26 Start Date: 18/Mar/21 12:26 Worklog Time Spent: 10m Work Description: touchida commented on a change in pull request #2585: URL: https://github.com/apache/hadoop/pull/2585#discussion_r596818639 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/erasurecode/StripedBlockReconstructor.java ## @@ -126,12 +128,26 @@ private void reconstructTargets(int toReconstructLen) throws IOException { int[] erasedIndices = stripedWriter.getRealTargetIndices(); ByteBuffer[] outputs = stripedWriter.getRealTargetBuffers(toReconstructLen); +if (isValidationEnabled()) { + markBuffers(inputs); + decode(inputs, erasedIndices, outputs); + resetBuffers(inputs); + + DataNodeFaultInjector.get().badDecoding(outputs); + getValidator().validate(inputs, erasedIndices, outputs); Review comment: @runitao Thanks for your comment! How about adding a metric for the exception like `EcInvalidReconstructionTasks`? (I saw your deleted comment.) As for logging, I think it's better to output more messages through the entire EC reconstruction process, and so I'd like to handle it in another issue. Are you suggesting anything else? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568301) Time Spent: 5h 10m (was: 5h) > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?focusedWorklogId=568303=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568303 ] ASF GitHub Bot logged work on HDFS-15759: - Author: ASF GitHub Bot Created on: 18/Mar/21 12:26 Start Date: 18/Mar/21 12:26 Worklog Time Spent: 10m Work Description: touchida commented on a change in pull request #2585: URL: https://github.com/apache/hadoop/pull/2585#discussion_r596820273 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestReconstructStripedFileWithValidator.java ## @@ -0,0 +1,98 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs; + +import org.apache.hadoop.hdfs.server.datanode.DataNodeFaultInjector; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.nio.ByteBuffer; +import java.util.concurrent.atomic.AtomicBoolean; + +/** + * This test extends {@link TestReconstructStripedFile} to test + * ec reconstruction validation. + */ +public class TestReconstructStripedFileWithValidator +extends TestReconstructStripedFile { + private static final Logger LOG = + LoggerFactory.getLogger(TestReconstructStripedFileWithValidator.class); + + public TestReconstructStripedFileWithValidator() { +LOG.info("run {} with validator.", +TestReconstructStripedFileWithValidator.class.getSuperclass() +.getSimpleName()); + } + + /** + * This test injects data pollution into decoded outputs once. + * When validation enabled, the first reconstruction task should fail + * in the validation, but the data will be recovered correctly + * by the next task. + * On the other hand, when validation disabled, the first reconstruction task + * will succeed and then lead to data corruption. + */ + @Test(timeout = 12) + public void testValidatorWithBadDecoding() + throws Exception { +DataNodeFaultInjector oldInjector = DataNodeFaultInjector.get(); +DataNodeFaultInjector badDecodingInjector = new DataNodeFaultInjector() { + private final AtomicBoolean flag = new AtomicBoolean(false); + + @Override + public void badDecoding(ByteBuffer[] outputs) { +if (!flag.get()) { + for (ByteBuffer output : outputs) { +output.mark(); +output.put((byte) (output.get(output.position()) + 1)); +output.reset(); + } +} +flag.set(true); + } +}; +DataNodeFaultInjector.set(badDecodingInjector); +int fileLen = +(getEcPolicy().getNumDataUnits() + getEcPolicy().getNumParityUnits()) +* getBlockSize() + getBlockSize() / 10; +try { + assertFileBlocksReconstruction( + "/testValidatorWithBadDecoding", + fileLen, + ReconstructionType.DataOnly, + getEcPolicy().getNumParityUnits()); Review comment: @runitao Agree! I'm now considering to check the metric that I mentioned at https://github.com/apache/hadoop/pull/2585#discussion_r596818639. If you know an another idea, please let me know. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568303) Time Spent: 5h 20m (was: 5h 10m) > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Time Spent: 5h 20m >
[jira] [Updated] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-15904: -- Labels: pull-request-available (was: ) > Flaky test TestBalancer#testBalancerWithSortTopNodes() > -- > > Key: HDFS-15904 > URL: https://issues.apache.org/jira/browse/HDFS-15904 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 > runs or so. It's reproducible locally also. Basically, balancing either moves > 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes > (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
[ https://issues.apache.org/jira/browse/HDFS-15904?focusedWorklogId=568267=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568267 ] ASF GitHub Bot logged work on HDFS-15904: - Author: ASF GitHub Bot Created on: 18/Mar/21 11:34 Start Date: 18/Mar/21 11:34 Worklog Time Spent: 10m Work Description: virajjasani opened a new pull request #2785: URL: https://github.com/apache/hadoop/pull/2785 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 568267) Remaining Estimate: 0h Time Spent: 10m > Flaky test TestBalancer#testBalancerWithSortTopNodes() > -- > > Key: HDFS-15904 > URL: https://issues.apache.org/jira/browse/HDFS-15904 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Fix For: 3.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 > runs or so. It's reproducible locally also. Basically, balancing either moves > 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes > (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15904) Flaky test TestBalancer#testBalancerWithSortTopNodes()
Viraj Jasani created HDFS-15904: --- Summary: Flaky test TestBalancer#testBalancerWithSortTopNodes() Key: HDFS-15904 URL: https://issues.apache.org/jira/browse/HDFS-15904 Project: Hadoop HDFS Issue Type: Test Reporter: Viraj Jasani Assignee: Viraj Jasani Fix For: 3.4.0 TestBalancer#testBalancerWithSortTopNodes shows some flakes in around ~10 runs or so. It's reproducible locally also. Basically, balancing either moves 2 blocks of size 100+100 bytes or it moves 3 blocks of size 100+100+50 bytes (2nd case causes flakies). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15895) DFSAdmin#printOpenFiles has redundant String#format usage
[ https://issues.apache.org/jira/browse/HDFS-15895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304057#comment-17304057 ] Viraj Jasani commented on HDFS-15895: - Thanks [~tasanuma] > DFSAdmin#printOpenFiles has redundant String#format usage > - > > Key: HDFS-15895 > URL: https://issues.apache.org/jira/browse/HDFS-15895 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15850) Superuser actions should be reported to external enforcers
[ https://issues.apache.org/jira/browse/HDFS-15850?focusedWorklogId=568217=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568217 ] ASF GitHub Bot logged work on HDFS-15850: - Author: ASF GitHub Bot Created on: 18/Mar/21 10:04 Start Date: 18/Mar/21 10:04 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2784: URL: https://github.com/apache/hadoop/pull/2784#issuecomment-801791798 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 7s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 38s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 25m 51s | | trunk passed | | +1 :green_heart: | compile | 5m 44s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 5m 22s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 25s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 14s | | trunk passed | | +1 :green_heart: | javadoc | 1m 49s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 34s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 3s | | trunk passed | | +1 :green_heart: | shadedclient | 19m 49s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 23s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 5s | | the patch passed | | +1 :green_heart: | compile | 5m 58s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 5m 58s | | the patch passed | | +1 :green_heart: | compile | 5m 28s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 5m 28s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 19s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/2/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 3 new + 495 unchanged - 1 fixed = 498 total (was 496) | | +1 :green_heart: | mvnsite | 2m 5s | | the patch passed | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 23s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 11s | | the patch passed | | +1 :green_heart: | shadedclient | 18m 41s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 373m 12s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | unit | 23m 22s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 529m 50s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestStateAlignmentContextWithHA | | | hadoop.hdfs.server.datanode.TestBlockScanner | | | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks | | | hadoop.hdfs.server.namenode.TestFileTruncate | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.TestPersistBlocks | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots | | | hadoop.hdfs.TestSnapshotCommands | | |
[jira] [Updated] (HDFS-15894) Trace Time-consuming RPC response of certain threshold.
[ https://issues.apache.org/jira/browse/HDFS-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renukaprasad C updated HDFS-15894: -- Attachment: HDFS-15894.003.patch > Trace Time-consuming RPC response of certain threshold. > --- > > Key: HDFS-15894 > URL: https://issues.apache.org/jira/browse/HDFS-15894 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Attachments: HDFS-15894.001.patch, HDFS-15894.002.patch, > HDFS-15894.003.patch > > > Monitor & Trace Time-consuming RPC requests. > Sometimes RPC Requests gets delayed, which impacts the system performance. > Currently, there is no track for delayed RPC request. > We can log such delayed RPC calls which exceeds certain threshold. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14731) [FGL] Remove redundant locking on NameNode.
[ https://issues.apache.org/jira/browse/HDFS-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303916#comment-17303916 ] Jeffrey(Xilang) Yan commented on HDFS-14731: Is posibble to backport this PR to Hadoop 2? > [FGL] Remove redundant locking on NameNode. > --- > > Key: HDFS-14731 > URL: https://issues.apache.org/jira/browse/HDFS-14731 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14731.001.patch > > > Currently NameNode has two global locks: FSNamesystemLock and > FSDirectoryLock. An analysis shows that single FSNamesystemLock is sufficient > to guarantee consistency of the NameNode state. FSDirectoryLock can be > removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303885#comment-17303885 ] Akira Ajisaka commented on HDFS-15900: -- bq. if it's ok to have several records sharing the same nameserviceId in activeNamespaces, IMO, there may be multiple active NameNodes if RBF supports Observer NameNodes in the future, so it's okay to have several records sharing the same nameserviceId in active Namespaces. However, it's not okay to have UNAVAILABLE NameNodes registrations in activeNamespaces (I used "we expect" because the source code is written in that way, sorry for the confusion). > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Attachments: image.png > > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15900) RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode
[ https://issues.apache.org/jira/browse/HDFS-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka reassigned HDFS-15900: Assignee: Harunobu Daikoku > RBF: empty blockpool id on dfsrouter caused by UNAVAILABLE NameNode > --- > > Key: HDFS-15900 > URL: https://issues.apache.org/jira/browse/HDFS-15900 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Harunobu Daikoku >Assignee: Harunobu Daikoku >Priority: Major > Attachments: image.png > > > We observed that when a NameNode becomes UNAVAILABLE, the corresponding > blockpool id in MembershipStoreImpl#activeNamespaces on dfsrouter > unintentionally sets to empty, its initial value. > !image.png|height=250! > As a result of this, concat operations through dfsrouter fail with the > following error as it cannot resolve the block id in the recognized active > namespaces. > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): > Cannot locate a nameservice for block pool BP-... > {noformat} > A possible fix is to ignore UNAVAILABLE NameNode registrations, and set > proper namespace information obtained from available NameNode registrations > when constructing the cache of active namespaces. > > [https://github.com/apache/hadoop/blob/rel/release-3.3.0/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/store/impl/MembershipStoreImpl.java#L207-L221] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303875#comment-17303875 ] Jinglun commented on HDFS-15899: Submit v02 fix checkstyle. The failed unit tests are not related. > Remove rpcThreadPool from DeadNodeDetector. > --- > > Key: HDFS-15899 > URL: https://issues.apache.org/jira/browse/HDFS-15899 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch > > > The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The > purpose is to use the thread pool timeout to monitor the probe timeout. But > the rpc client already has a timeout. We can use the rpc client timeout > instead of the thread pool timeout and remove the rpcThreadPool. > The rpcThreadPool introduces additional complexity for probing the DataNode. > The probe task waiting in the busy rpcThreadPool might exceed the configured > timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15899: --- Attachment: HDFS-15899.002.patch > Remove rpcThreadPool from DeadNodeDetector. > --- > > Key: HDFS-15899 > URL: https://issues.apache.org/jira/browse/HDFS-15899 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch > > > The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The > purpose is to use the thread pool timeout to monitor the probe timeout. But > the rpc client already has a timeout. We can use the rpc client timeout > instead of the thread pool timeout and remove the rpcThreadPool. > The rpcThreadPool introduces additional complexity for probing the DataNode. > The probe task waiting in the busy rpcThreadPool might exceed the configured > timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15850) Superuser actions should be reported to external enforcers
[ https://issues.apache.org/jira/browse/HDFS-15850?focusedWorklogId=568100=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-568100 ] ASF GitHub Bot logged work on HDFS-15850: - Author: ASF GitHub Bot Created on: 18/Mar/21 06:12 Start Date: 18/Mar/21 06:12 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2784: URL: https://github.com/apache/hadoop/pull/2784#issuecomment-801659165 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 52s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 13m 57s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 23m 3s | | trunk passed | | +1 :green_heart: | compile | 5m 22s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 4m 37s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 16s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 58s | | trunk passed | | +1 :green_heart: | javadoc | 1m 30s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 16s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 16m 48s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 21s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 44s | | the patch passed | | +1 :green_heart: | compile | 5m 0s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 5m 0s | | the patch passed | | +1 :green_heart: | compile | 4m 34s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 4m 34s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 9s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/1/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 2 new + 317 unchanged - 1 fixed = 319 total (was 318) | | +1 :green_heart: | mvnsite | 1m 47s | | the patch passed | | +1 :green_heart: | javadoc | 1m 16s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 4s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 38s | | the patch passed | | +1 :green_heart: | shadedclient | 16m 50s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 371m 21s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | -1 :x: | unit | 25m 48s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2784/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 47s | | The patch does not generate ASF License warnings. | | | | 515m 17s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.datanode.TestBlockScanner | | | hadoop.fs.viewfs.TestViewFileSystemOverloadSchemeWithHdfsScheme | | | hadoop.hdfs.TestViewDistributedFileSystemWithMountLinks | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.TestPersistBlocks | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.TestDFSShell | | |