[jira] [Updated] (HDFS-17272) NNThroughputBenchmark should support specifying the base directory for multi-client test
[ https://issues.apache.org/jira/browse/HDFS-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-17272: --- Status: Patch Available (was: In Progress) > NNThroughputBenchmark should support specifying the base directory for > multi-client test > > > Key: HDFS-17272 > URL: https://issues.apache.org/jira/browse/HDFS-17272 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > > Currently, NNThroughputBenchmark does not support specifying the base > directory, therefore does not support multiple clients performing stress > testing at the same time. However, for high-performance namenode machine, > only one client submitting stress test can not make the namenode rpc access > reach the bottleneck. Therefore, multiple clients are required for parallel > testing to make the namenode pressure reach the level of the large-scale > production cluster. > So I specify the base directory through the -baseDirName parameter to support > multiple clients submitting stress tests at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17272) NNThroughputBenchmark should support specifying the base directory for multi-client test
[ https://issues.apache.org/jira/browse/HDFS-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-17272: --- Status: In Progress (was: Patch Available) > NNThroughputBenchmark should support specifying the base directory for > multi-client test > > > Key: HDFS-17272 > URL: https://issues.apache.org/jira/browse/HDFS-17272 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > > Currently, NNThroughputBenchmark does not support specifying the base > directory, therefore does not support multiple clients performing stress > testing at the same time. However, for high-performance namenode machine, > only one client submitting stress test can not make the namenode rpc access > reach the bottleneck. Therefore, multiple clients are required for parallel > testing to make the namenode pressure reach the level of the large-scale > production cluster. > So I specify the base directory through the -baseDirName parameter to support > multiple clients submitting stress tests at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17272) NNThroughputBenchmark should support specifying the base directory for multi-client test
[ https://issues.apache.org/jira/browse/HDFS-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-17272: --- Status: Patch Available (was: Open) > NNThroughputBenchmark should support specifying the base directory for > multi-client test > > > Key: HDFS-17272 > URL: https://issues.apache.org/jira/browse/HDFS-17272 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > > Currently, NNThroughputBenchmark does not support specifying the base > directory, therefore does not support multiple clients performing stress > testing at the same time. However, for high-performance namenode machine, > only one client submitting stress test can not make the namenode rpc access > reach the bottleneck. Therefore, multiple clients are required for parallel > testing to make the namenode pressure reach the level of the large-scale > production cluster. > So I specify the base directory through the -baseDirName parameter to support > multiple clients submitting stress tests at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17272) NNThroughputBenchmark should support specifying the base directory for multi-client test
caozhiqiang created HDFS-17272: -- Summary: NNThroughputBenchmark should support specifying the base directory for multi-client test Key: HDFS-17272 URL: https://issues.apache.org/jira/browse/HDFS-17272 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.4.0 Reporter: caozhiqiang Assignee: caozhiqiang Currently, NNThroughputBenchmark does not support specifying the base directory, therefore does not support multiple clients performing stress testing at the same time. However, for high-performance namenode machine, only one client submitting stress test can not make the namenode rpc access reach the bottleneck. Therefore, multiple clients are required for parallel testing to make the namenode pressure reach the level of the large-scale production cluster. So I specify the base directory through the -baseDirName parameter to support multiple clients submitting stress tests at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-15869: --- Attachment: 2.png > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Attachments: 1.png, 2.png > > Time Spent: 6.5h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable because FSEditLogAsync’s key feature is > asynchronous edit logging that is supposed to tolerate slow I/O. > To see why the sendResponse operation in line 365 may get stuck, here is > the stack trace: > {code:java} > '(org.apache.hadoop.ipc.Server,channelWrite,3593)', > '(org.apache.hadoop.ipc.Server,access$1700,139)', > '(org.apache.hadoop.ipc.Server$Responder,processResponse,1657)', > '(org.apache.hadoop.ipc.Server$Responder,doRespond,1727)', > '(org.apache.hadoop.ipc.Server$Connection,sendResponse,2828)', > '(org.apache.hadoop.ipc.Server$Connection,access$300,1799)', > '(org.apache.hadoop.ipc.Server$RpcCall,doResponse,)', > '(org.apache.hadoop.ipc.Server$Call,doResponse,903)', > '(org.apache.hadoop.ipc.Server$Call,sendResponse,889)', > > '(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$RpcEdit,logSyncNotify,365)', > '(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync,run,248)', > '(java.lang.Thread,run,748)' > {code} > The `channelWrite` function is defined as follows: > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > private int channelWrite(WritableByteChannel channel, >ByteBuffer
[jira] [Commented] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753092#comment-17753092 ] caozhiqiang commented on HDFS-15869: After I apply this patch, I found the performance of namenode degraded obviously, use the NNThroughputBenchmark tool. From the flame graph analyzed by async-profiler, it can be seen that most of the CPU is spent on lock competition. !1.png|width=568,height=276! !2.png|width=717,height=184! > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Attachments: 1.png, 2.png > > Time Spent: 6.5h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable because FSEditLogAsync’s key feature is > asynchronous edit logging that is supposed to tolerate slow I/O. > To see why the sendResponse operation in line 365 may get stuck, here is > the stack trace: > {code:java} > '(org.apache.hadoop.ipc.Server,channelWrite,3593)', > '(org.apache.hadoop.ipc.Server,access$1700,139)', > '(org.apache.hadoop.ipc.Server$Responder,processResponse,1657)', > '(org.apache.hadoop.ipc.Server$Responder,doRespond,1727)', > '(org.apache.hadoop.ipc.Server$Connection,sendResponse,2828)', > '(org.apache.hadoop.ipc.Server$Connection,access$300,1799)', > '(org.apache.hadoop.ipc.Server$RpcCall,doResponse,)', > '(org.apache.hadoop.ipc.Server$Call,doResponse,903)', > '(org.apache.hadoop.ipc.Server$Call,sendResponse,889)', > > '(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$RpcEdit,logSyncNotify,365)', >
[jira] [Updated] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-15869: --- Attachment: 1.png > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Attachments: 1.png, 2.png > > Time Spent: 6.5h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable because FSEditLogAsync’s key feature is > asynchronous edit logging that is supposed to tolerate slow I/O. > To see why the sendResponse operation in line 365 may get stuck, here is > the stack trace: > {code:java} > '(org.apache.hadoop.ipc.Server,channelWrite,3593)', > '(org.apache.hadoop.ipc.Server,access$1700,139)', > '(org.apache.hadoop.ipc.Server$Responder,processResponse,1657)', > '(org.apache.hadoop.ipc.Server$Responder,doRespond,1727)', > '(org.apache.hadoop.ipc.Server$Connection,sendResponse,2828)', > '(org.apache.hadoop.ipc.Server$Connection,access$300,1799)', > '(org.apache.hadoop.ipc.Server$RpcCall,doResponse,)', > '(org.apache.hadoop.ipc.Server$Call,doResponse,903)', > '(org.apache.hadoop.ipc.Server$Call,sendResponse,889)', > > '(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$RpcEdit,logSyncNotify,365)', > '(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync,run,248)', > '(java.lang.Thread,run,748)' > {code} > The `channelWrite` function is defined as follows: > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > private int channelWrite(WritableByteChannel channel, >ByteBuffer
[jira] [Updated] (HDFS-16983) Whether checking path access permissions should be decided by dfs.permissions.enabled in concat operation
[ https://issues.apache.org/jira/browse/HDFS-16983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16983: --- Description: In concat RPC, it will call FSDirConcatOp::verifySrcFiles() to check the source files. In this function, it would make permission check for srcs. Whether do the permission check should be decided by dfs.permissions.enabled configuration. And the 'pc' parameter is always not null. So we should change 'if (pc != null)' to 'if (fsd.isPermissionEnabled())'. {code:java} // permission check for srcs if (pc != null) { fsd.checkPathAccess(pc, iip, FsAction.READ); // read the file fsd.checkParentAccess(pc, iip, FsAction.WRITE); // for delete } {code} was: In concat RPC, it will call FSDirConcatOp::verifySrcFiles() to check the source files. In this function, it would make permission check for srcs. Whether do the permission check should be decided by dfs.permissions.enabled configuration. And the 'pc' parameter is always not null. {code:java} // permission check for srcs if (pc != null) { fsd.checkPathAccess(pc, iip, FsAction.READ); // read the file fsd.checkParentAccess(pc, iip, FsAction.WRITE); // for delete } {code} > Whether checking path access permissions should be decided by > dfs.permissions.enabled in concat operation > - > > Key: HDFS-16983 > URL: https://issues.apache.org/jira/browse/HDFS-16983 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > > In concat RPC, it will call FSDirConcatOp::verifySrcFiles() to check the > source files. In this function, it would make permission check for srcs. > Whether do the permission check should be decided by dfs.permissions.enabled > configuration. And the 'pc' parameter is always not null. > So we should change 'if (pc != null)' to 'if (fsd.isPermissionEnabled())'. > {code:java} > // permission check for srcs > if (pc != null) { > fsd.checkPathAccess(pc, iip, FsAction.READ); // read the file > fsd.checkParentAccess(pc, iip, FsAction.WRITE); // for delete > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16983) Whether checking path access permissions should be decided by dfs.permissions.enabled in concat operation
[ https://issues.apache.org/jira/browse/HDFS-16983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16983: --- Status: Patch Available (was: In Progress) > Whether checking path access permissions should be decided by > dfs.permissions.enabled in concat operation > - > > Key: HDFS-16983 > URL: https://issues.apache.org/jira/browse/HDFS-16983 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > > In concat RPC, it will call FSDirConcatOp::verifySrcFiles() to check the > source files. In this function, it would make permission check for srcs. > Whether do the permission check should be decided by dfs.permissions.enabled > configuration. And the 'pc' parameter is always not null. > {code:java} > // permission check for srcs > if (pc != null) { > fsd.checkPathAccess(pc, iip, FsAction.READ); // read the file > fsd.checkParentAccess(pc, iip, FsAction.WRITE); // for delete > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16983) Whether checking path access permissions should be decided by dfs.permissions.enabled in concat operation
caozhiqiang created HDFS-16983: -- Summary: Whether checking path access permissions should be decided by dfs.permissions.enabled in concat operation Key: HDFS-16983 URL: https://issues.apache.org/jira/browse/HDFS-16983 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.4.0 Reporter: caozhiqiang Assignee: caozhiqiang In concat RPC, it will call FSDirConcatOp::verifySrcFiles() to check the source files. In this function, it would make permission check for srcs. Whether do the permission check should be decided by dfs.permissions.enabled configuration. And the 'pc' parameter is always not null. {code:java} // permission check for srcs if (pc != null) { fsd.checkPathAccess(pc, iip, FsAction.READ); // read the file fsd.checkParentAccess(pc, iip, FsAction.WRITE); // for delete } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635400#comment-17635400 ] caozhiqiang commented on HDFS-16613: [~tasanuma] , OK, I have create an issue [HDFS-16846|https://issues.apache.org/jira/browse/HDFS-16846]. Please help to review. > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, > image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16846) EC: Only EC blocks should be effected by max-streams-hard-limit configuration
[ https://issues.apache.org/jira/browse/HDFS-16846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16846: --- Status: Patch Available (was: In Progress) > EC: Only EC blocks should be effected by max-streams-hard-limit configuration > - > > Key: HDFS-16846 > URL: https://issues.apache.org/jira/browse/HDFS-16846 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > > In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], the > dfs.namenode.replication.max-streams-hard-limit configuration will only > affect decommissioning DataNode, but will not distinguish between replication > blocks and EC blocks. Even if DataNodes have only replication files, they > will always generate high network traffic. So this configuration should only > effect EC blocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16846) EC: Only EC blocks should be effected by max-streams-hard-limit configuration
[ https://issues.apache.org/jira/browse/HDFS-16846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16846: --- Component/s: ec > EC: Only EC blocks should be effected by max-streams-hard-limit configuration > - > > Key: HDFS-16846 > URL: https://issues.apache.org/jira/browse/HDFS-16846 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > > In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], the > dfs.namenode.replication.max-streams-hard-limit configuration will only > affect decommissioning DataNode, but will not distinguish between replication > blocks and EC blocks. Even if DataNodes have only replication files, they > will always generate high network traffic. So this configuration should only > effect EC blocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16846) EC: Only EC blocks should be effected by max-streams-hard-limit configuration
[ https://issues.apache.org/jira/browse/HDFS-16846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16846: --- Summary: EC: Only EC blocks should be effected by max-streams-hard-limit configuration (was: EC: Only EC blocks shoud be effect by max-streams-hard-limit configuration) > EC: Only EC blocks should be effected by max-streams-hard-limit configuration > - > > Key: HDFS-16846 > URL: https://issues.apache.org/jira/browse/HDFS-16846 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > > In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], the > dfs.namenode.replication.max-streams-hard-limit configuration will only > affect decommissioning DataNode, but will not distinguish between replication > blocks and EC blocks. Even if DataNodes have only replication files, they > will always generate high network traffic. So this configuration should only > effect EC blocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16846) EC: Only EC blocks shoud be effect by max-streams-hard-limit configuration
caozhiqiang created HDFS-16846: -- Summary: EC: Only EC blocks shoud be effect by max-streams-hard-limit configuration Key: HDFS-16846 URL: https://issues.apache.org/jira/browse/HDFS-16846 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.4.0 Reporter: caozhiqiang Assignee: caozhiqiang In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], the dfs.namenode.replication.max-streams-hard-limit configuration will only affect decommissioning DataNode, but will not distinguish between replication blocks and EC blocks. Even if DataNodes have only replication files, they will always generate high network traffic. So this configuration should only effect EC blocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634860#comment-17634860 ] caozhiqiang commented on HDFS-16613: [~tasanuma], Yes, with this change, the dfs.namenode.replication.max-streams-hard-limit configuration will only affect decommissioning DataNode, but will not distinguish between replication blocks and EC blocks. If you consider this configuration should only effect EC blocks, we can change code in DatanodeManager like below: {code:java} int maxReplicaTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress;; int maxEcTransfers; if (nodeinfo.isDecommissionInProgress()) { maxEcTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxEcTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } int numReplicationTasks = (int) Math.ceil( (double) (totalReplicateBlocks * maxReplicaTransfers) / totalBlocks); int numECTasks = (int) Math.ceil( (double) (totalECBlocks * maxEcTransfers) / totalBlocks); {code} > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, > image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598800#comment-17598800 ] caozhiqiang commented on HDFS-16663: [~tasanuma] [~hexiaoqiao] [~weichiu] [~haiyang Hu] [~hadachi] , Could you help to continue reviewing this patch? Thanks. > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In HDFS-16613, increase the value of > dfs.namenode.replication.max-streams-hard-limit would maximize the IO > performance of the decommissioning DN, which has a lot of EC blocks. Besides > this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in most of this 5 > minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In HDFS-14560, the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570594#comment-17570594 ] caozhiqiang commented on HDFS-16663: [~hadachi] , Thank you for your review! > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In HDFS-16613, increase the value of > dfs.namenode.replication.max-streams-hard-limit would maximize the IO > performance of the decommissioning DN, which has a lot of EC blocks. Besides > this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in most of this 5 > minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In HDFS-14560, the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568417#comment-17568417 ] caozhiqiang commented on HDFS-16663: [~hadachi] [~tasanuma] , would you help to review this patch if you hive time? This is related to [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613]. Thank you. > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In HDFS-16613, increase the value of > dfs.namenode.replication.max-streams-hard-limit would maximize the IO > performance of the decommissioning DN, which has a lot of EC blocks. Besides > this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in most of this 5 > minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In HDFS-14560, the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16663: --- Description: In HDFS-16613, increase the value of dfs.namenode.replication.max-streams-hard-limit would maximize the IO performance of the decommissioning DN, which has a lot of EC blocks. Besides this, we also need to decrease the value of dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to shorten the interval time for checking pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in most of this 5 minutes. In decommission progress, we may need to reconfigure these 2 parameters several times. In HDFS-14560, the dfs.namenode.replication.max-streams-hard-limit can already be reconfigured dynamically without namenode restart. And the dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be reconfigured dynamically. was: In HDFS-16613, increase the value of dfs.namenode.replication.max-streams-hard-limit would maximize the IO performance of the decommissioning DN, which has a lot of EC blocks. Besides this, we also need to decrease the value of dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to shorten the interval time for checking pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in much time of this 5 minutes. In decommission progress, we may need to reconfigure these 2 parameters several times. In HDFS-14560, the dfs.namenode.replication.max-streams-hard-limit can already be reconfigured dynamically without namenode restart. And the dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be reconfigured dynamically. > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In HDFS-16613, increase the value of > dfs.namenode.replication.max-streams-hard-limit would maximize the IO > performance of the decommissioning DN, which has a lot of EC blocks. Besides > this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in most of this 5 > minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In HDFS-14560, the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16663: --- Description: In HDFS-16613, increase the value of dfs.namenode.replication.max-streams-hard-limit would maximize the IO performance of the decommissioning DN, which has a lot of EC blocks. Besides this, we also need to decrease the value of dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to shorten the interval time for checking pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in much time of this 5 minutes. In decommission progress, we may need to reconfigure these 2 parameters several times. In HDFS-14560, the dfs.namenode.replication.max-streams-hard-limit can already be reconfigured dynamically without namenode restart. And the dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be reconfigured dynamically. was: In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], increase the value of dfs.namenode.replication.max-streams-hard-limit would maximize the IO performance of the decommissioning DN, witch has a lot of EC blocks. Besides this, we also need to decrease the value of dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to shorten the interval time for checking pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in much time of this 5 minutes. In decommission progress, we may need to reconfigure these 2 parameters several times. In [HDFS-14560|https://issues.apache.org/jira/browse/HDFS-14560], the dfs.namenode.replication.max-streams-hard-limit can already be reconfigured dynamically without namenode restart. And the dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be reconfigured dynamically. > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In HDFS-16613, increase the value of > dfs.namenode.replication.max-streams-hard-limit would maximize the IO > performance of the decommissioning DN, which has a lot of EC blocks. Besides > this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in much time of > this 5 minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In HDFS-14560, the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16663 started by caozhiqiang. -- > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], increase > the value of dfs.namenode.replication.max-streams-hard-limit would maximize > the IO performance of the decommissioning DN, witch has a lot of EC blocks. > Besides this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in much time of > this 5 minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In > [HDFS-14560|https://issues.apache.org/jira/browse/HDFS-14560], the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16663: --- Status: Patch Available (was: In Progress) > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], increase > the value of dfs.namenode.replication.max-streams-hard-limit would maximize > the IO performance of the decommissioning DN, witch has a lot of EC blocks. > Besides this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in much time of > this 5 minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In > [HDFS-14560|https://issues.apache.org/jira/browse/HDFS-14560], the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16663) Allow block reconstruction pending timeout refreshable to increase decommission performance
[ https://issues.apache.org/jira/browse/HDFS-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16663: --- Summary: Allow block reconstruction pending timeout refreshable to increase decommission performance (was: Allow block reconstruction pending timeout to be refreshable) > Allow block reconstruction pending timeout refreshable to increase > decommission performance > --- > > Key: HDFS-16663 > URL: https://issues.apache.org/jira/browse/HDFS-16663 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > > In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], increase > the value of dfs.namenode.replication.max-streams-hard-limit would maximize > the IO performance of the decommissioning DN, witch has a lot of EC blocks. > Besides this, we also need to decrease the value of > dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to > shorten the interval time for checking pendingReconstructions. Or the > decommissioning node would be idle to wait for copy tasks in much time of > this 5 minutes. > In decommission progress, we may need to reconfigure these 2 parameters > several times. In > [HDFS-14560|https://issues.apache.org/jira/browse/HDFS-14560], the > dfs.namenode.replication.max-streams-hard-limit can already be reconfigured > dynamically without namenode restart. And the > dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be > reconfigured dynamically. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16663) Allow block reconstruction pending timeout to be refreshable
caozhiqiang created HDFS-16663: -- Summary: Allow block reconstruction pending timeout to be refreshable Key: HDFS-16663 URL: https://issues.apache.org/jira/browse/HDFS-16663 Project: Hadoop HDFS Issue Type: Improvement Components: ec, namenode Affects Versions: 3.4.0 Reporter: caozhiqiang Assignee: caozhiqiang In [HDFS-16613|https://issues.apache.org/jira/browse/HDFS-16613], increase the value of dfs.namenode.replication.max-streams-hard-limit would maximize the IO performance of the decommissioning DN, witch has a lot of EC blocks. Besides this, we also need to decrease the value of dfs.namenode.reconstruction.pending.timeout-sec, default is 5 minutes, to shorten the interval time for checking pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in much time of this 5 minutes. In decommission progress, we may need to reconfigure these 2 parameters several times. In [HDFS-14560|https://issues.apache.org/jira/browse/HDFS-14560], the dfs.namenode.replication.max-streams-hard-limit can already be reconfigured dynamically without namenode restart. And the dfs.namenode.reconstruction.pending.timeout-sec parameter also need to be reconfigured dynamically. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553020#comment-17553020 ] caozhiqiang commented on HDFS-16613: [~hadachi] , thank you. Could you help to review this PR [GitHub Pull Request #4398|https://github.com/apache/hadoop/pull/4398] if this approach works? > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, > image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
[ https://issues.apache.org/jira/browse/HDFS-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16626: --- Issue Type: Bug (was: Improvement) > Under replicated blocks in dfsadmin report should contain > pendingReconstruction‘s blocks > > > Key: HDFS-16626 > URL: https://issues.apache.org/jira/browse/HDFS-16626 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-08-18-30-13-757.png > > Time Spent: 40m > Remaining Estimate: 0h > > In the output of command 'hdfs dfsadmin -report', the value of Under > replicated blocks and ec Low redundancy block groups only contains the block > number in BlockManager::neededReconstruction. It should also contain the > block number in BlockManager::pendingReconstruction, include the timeout > items. Specially, in some scenario, for example, decommission a dn with a lot > of ec blocks, there would be a lot blocks in pendingReconstruction at a long > time but neededReconstruction's size may be 0. That will confuse user and > they can't access the real decommissioning progress. > {code:java} > Configured Capacity: 1036741707829248 (942.91 TB) > Present Capacity: 983872491622400 (894.83 TB) > DFS Remaining: 974247450424426 (886.07 TB) > DFS Used: 9625041197974 (8.75 TB) > DFS Used%: 0.98% > Replicated Blocks: > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor 1): 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 0 > Erasure Coded Block Groups: > Low redundancy block groups: 3481 > Block groups with corrupt internal blocks: 0 > Missing block groups: 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 245 {code} > The below graph show the metrics monitor of under_replicated_blocks and > pending_replicated_blocks in decommissioning a datanode process. The value of > pending_replicated_blocks would not be included in dfsadmin report. > !image-2022-06-08-18-30-13-757.png|width=836,height=157! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
[ https://issues.apache.org/jira/browse/HDFS-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16626: --- Description: In the output of command 'hdfs dfsadmin -report', the value of Under replicated blocks and ec Low redundancy block groups only contains the block number in BlockManager::neededReconstruction. It should also contain the block number in BlockManager::pendingReconstruction, include the timeout items. Specially, in some scenario, for example, decommission a dn with a lot of ec blocks, there would be a lot blocks in pendingReconstruction at a long time but neededReconstruction's size may be 0. That will confuse user and they can't access the real decommissioning progress. {code:java} Configured Capacity: 1036741707829248 (942.91 TB) Present Capacity: 983872491622400 (894.83 TB) DFS Remaining: 974247450424426 (886.07 TB) DFS Used: 9625041197974 (8.75 TB) DFS Used%: 0.98% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 3481 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 245 {code} The below graph show the metrics monitor of under_replicated_blocks and pending_replicated_blocks in decommissioning a datanode process. The value of pending_replicated_blocks would not be included in dfsadmin report. !image-2022-06-08-18-30-13-757.png|width=836,height=157! was: In the output of command 'hdfs dfsadmin -report', the value of Under replicated blocks and ec Low redundancy block groups only contains the block number in BlockManager::neededReconstruction. It should also contain the block number in BlockManager::pendingReconstruction, include the timeout items. Specially, in some scenario, for example, decommission a dn with a lot of ec blocks, there would be a lot blocks in pendingReconstruction at a long time but neededReconstruction's size may be 0. That will confuse user and they can't access the real decommissioning progress. {code:java} Configured Capacity: 1036741707829248 (942.91 TB) Present Capacity: 983872491622400 (894.83 TB) DFS Remaining: 974247450424426 (886.07 TB) DFS Used: 9625041197974 (8.75 TB) DFS Used%: 0.98% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 3481 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 245 {code} The below graph show the metrics monitor of under_replicated_blocks and pending_replicated_blocks in decommissioning a datanode process. The value of pending_replicated_blocks would not be included in dfsadmin report. !image-2022-06-08-11-38-29-664.png|width=1319,height=248! > Under replicated blocks in dfsadmin report should contain > pendingReconstruction‘s blocks > > > Key: HDFS-16626 > URL: https://issues.apache.org/jira/browse/HDFS-16626 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-08-18-30-13-757.png > > Time Spent: 10m > Remaining Estimate: 0h > > In the output of command 'hdfs dfsadmin -report', the value of Under > replicated blocks and ec Low redundancy block groups only contains the block > number in BlockManager::neededReconstruction. It should also contain the > block number in BlockManager::pendingReconstruction, include the timeout > items. Specially, in some scenario, for example, decommission a dn with a lot > of ec blocks, there would be a lot blocks in pendingReconstruction at a long > time but neededReconstruction's size may be 0. That will confuse user and > they can't access the real decommissioning progress. > {code:java} > Configured Capacity: 1036741707829248 (942.91 TB) > Present Capacity: 983872491622400 (894.83 TB) > DFS Remaining: 974247450424426 (886.07 TB) > DFS Used: 9625041197974 (8.75 TB) > DFS Used%: 0.98% > Replicated Blocks: > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor
[jira] [Updated] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
[ https://issues.apache.org/jira/browse/HDFS-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16626: --- Attachment: image-2022-06-08-18-30-13-757.png > Under replicated blocks in dfsadmin report should contain > pendingReconstruction‘s blocks > > > Key: HDFS-16626 > URL: https://issues.apache.org/jira/browse/HDFS-16626 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-08-18-30-13-757.png > > Time Spent: 10m > Remaining Estimate: 0h > > In the output of command 'hdfs dfsadmin -report', the value of Under > replicated blocks and ec Low redundancy block groups only contains the block > number in BlockManager::neededReconstruction. It should also contain the > block number in BlockManager::pendingReconstruction, include the timeout > items. Specially, in some scenario, for example, decommission a dn with a lot > of ec blocks, there would be a lot blocks in pendingReconstruction at a long > time but neededReconstruction's size may be 0. That will confuse user and > they can't access the real decommissioning progress. > {code:java} > Configured Capacity: 1036741707829248 (942.91 TB) > Present Capacity: 983872491622400 (894.83 TB) > DFS Remaining: 974247450424426 (886.07 TB) > DFS Used: 9625041197974 (8.75 TB) > DFS Used%: 0.98% > Replicated Blocks: > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor 1): 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 0 > Erasure Coded Block Groups: > Low redundancy block groups: 3481 > Block groups with corrupt internal blocks: 0 > Missing block groups: 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 245 {code} > The below graph show the metrics monitor of under_replicated_blocks and > pending_replicated_blocks in decommissioning a datanode process. The value of > pending_replicated_blocks would not be included in dfsadmin report. > !image-2022-06-08-11-38-29-664.png|width=1319,height=248! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
[ https://issues.apache.org/jira/browse/HDFS-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16626: --- Description: In the output of command 'hdfs dfsadmin -report', the value of Under replicated blocks and ec Low redundancy block groups only contains the block number in BlockManager::neededReconstruction. It should also contain the block number in BlockManager::pendingReconstruction, include the timeout items. Specially, in some scenario, for example, decommission a dn with a lot of ec blocks, there would be a lot blocks in pendingReconstruction at a long time but neededReconstruction's size may be 0. That will confuse user and they can't access the real decommissioning progress. {code:java} Configured Capacity: 1036741707829248 (942.91 TB) Present Capacity: 983872491622400 (894.83 TB) DFS Remaining: 974247450424426 (886.07 TB) DFS Used: 9625041197974 (8.75 TB) DFS Used%: 0.98% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 3481 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 245 {code} The below graph show the metrics monitor of under_replicated_blocks and pending_replicated_blocks in decommissioning a datanode process. The value of pending_replicated_blocks would not be included in dfsadmin report. !image-2022-06-08-11-38-29-664.png|width=1319,height=248! was: In the output of command 'hdfs dfsadmin -report', the value of Under replicated blocks and ec Low redundancy block groups only contains the block number in BlockManager::neededReconstruction. It should also contain the block number in BlockManager::pendingReconstruction, include the timeout items. Specially, in some scenario, for example, decommission a dn with a lot of ec blocks, there would be a lot blocks in pendingReconstruction at a long time but neededReconstruction's size may be 0. That will confuse user and they can't access the real decommissioning progress. {code:java} Configured Capacity: 1036741707829248 (942.91 TB) Present Capacity: 983872491622400 (894.83 TB) DFS Remaining: 974247450424426 (886.07 TB) DFS Used: 9625041197974 (8.75 TB) DFS Used%: 0.98% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 3481 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 245 {code} > Under replicated blocks in dfsadmin report should contain > pendingReconstruction‘s blocks > > > Key: HDFS-16626 > URL: https://issues.apache.org/jira/browse/HDFS-16626 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In the output of command 'hdfs dfsadmin -report', the value of Under > replicated blocks and ec Low redundancy block groups only contains the block > number in BlockManager::neededReconstruction. It should also contain the > block number in BlockManager::pendingReconstruction, include the timeout > items. Specially, in some scenario, for example, decommission a dn with a lot > of ec blocks, there would be a lot blocks in pendingReconstruction at a long > time but neededReconstruction's size may be 0. That will confuse user and > they can't access the real decommissioning progress. > {code:java} > Configured Capacity: 1036741707829248 (942.91 TB) > Present Capacity: 983872491622400 (894.83 TB) > DFS Remaining: 974247450424426 (886.07 TB) > DFS Used: 9625041197974 (8.75 TB) > DFS Used%: 0.98% > Replicated Blocks: > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor 1): 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 0 > Erasure Coded Block Groups: > Low redundancy block groups: 3481 > Block groups with corrupt internal blocks: 0 > Missing block groups: 0 > Low redundancy blocks with highest priority to recover: 0 > Pending
[jira] [Work started] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
[ https://issues.apache.org/jira/browse/HDFS-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16626 started by caozhiqiang. -- > Under replicated blocks in dfsadmin report should contain > pendingReconstruction‘s blocks > > > Key: HDFS-16626 > URL: https://issues.apache.org/jira/browse/HDFS-16626 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In the output of command 'hdfs dfsadmin -report', the value of Under > replicated blocks and ec Low redundancy block groups only contains the block > number in BlockManager::neededReconstruction. It should also contain the > block number in BlockManager::pendingReconstruction, include the timeout > items. Specially, in some scenario, for example, decommission a dn with a lot > of ec blocks, there would be a lot blocks in pendingReconstruction at a long > time but neededReconstruction's size may be 0. That will confuse user and > they can't access the real decommissioning progress. > {code:java} > Configured Capacity: 1036741707829248 (942.91 TB) > Present Capacity: 983872491622400 (894.83 TB) > DFS Remaining: 974247450424426 (886.07 TB) > DFS Used: 9625041197974 (8.75 TB) > DFS Used%: 0.98% > Replicated Blocks: > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor 1): 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 0 > Erasure Coded Block Groups: > Low redundancy block groups: 3481 > Block groups with corrupt internal blocks: 0 > Missing block groups: 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 245 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
[ https://issues.apache.org/jira/browse/HDFS-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16626: --- Status: Patch Available (was: In Progress) > Under replicated blocks in dfsadmin report should contain > pendingReconstruction‘s blocks > > > Key: HDFS-16626 > URL: https://issues.apache.org/jira/browse/HDFS-16626 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In the output of command 'hdfs dfsadmin -report', the value of Under > replicated blocks and ec Low redundancy block groups only contains the block > number in BlockManager::neededReconstruction. It should also contain the > block number in BlockManager::pendingReconstruction, include the timeout > items. Specially, in some scenario, for example, decommission a dn with a lot > of ec blocks, there would be a lot blocks in pendingReconstruction at a long > time but neededReconstruction's size may be 0. That will confuse user and > they can't access the real decommissioning progress. > {code:java} > Configured Capacity: 1036741707829248 (942.91 TB) > Present Capacity: 983872491622400 (894.83 TB) > DFS Remaining: 974247450424426 (886.07 TB) > DFS Used: 9625041197974 (8.75 TB) > DFS Used%: 0.98% > Replicated Blocks: > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor 1): 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 0 > Erasure Coded Block Groups: > Low redundancy block groups: 3481 > Block groups with corrupt internal blocks: 0 > Missing block groups: 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 245 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
caozhiqiang created HDFS-16626: -- Summary: Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks Key: HDFS-16626 URL: https://issues.apache.org/jira/browse/HDFS-16626 Project: Hadoop HDFS Issue Type: Improvement Components: ec, namanode Affects Versions: 3.4.0 Reporter: caozhiqiang Assignee: caozhiqiang In the output of command 'hdfs dfsadmin -report', the value of Under replicated blocks and ec Low redundancy block groups only contains the block number in BlockManager::neededReconstruction. It should also contain the block number in BlockManager::pendingReconstruction, include the timeout items. Specially, in some scenario, for example, decommission a dn with a lot of ec blocks, there would be a lot blocks in pendingReconstruction at a long time but neededReconstruction's size may be 0. That will confuse user and they can't access the real decommissioning progress. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16626) Under replicated blocks in dfsadmin report should contain pendingReconstruction‘s blocks
[ https://issues.apache.org/jira/browse/HDFS-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16626: --- Description: In the output of command 'hdfs dfsadmin -report', the value of Under replicated blocks and ec Low redundancy block groups only contains the block number in BlockManager::neededReconstruction. It should also contain the block number in BlockManager::pendingReconstruction, include the timeout items. Specially, in some scenario, for example, decommission a dn with a lot of ec blocks, there would be a lot blocks in pendingReconstruction at a long time but neededReconstruction's size may be 0. That will confuse user and they can't access the real decommissioning progress. {code:java} Configured Capacity: 1036741707829248 (942.91 TB) Present Capacity: 983872491622400 (894.83 TB) DFS Remaining: 974247450424426 (886.07 TB) DFS Used: 9625041197974 (8.75 TB) DFS Used%: 0.98% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 3481 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 245 {code} was:In the output of command 'hdfs dfsadmin -report', the value of Under replicated blocks and ec Low redundancy block groups only contains the block number in BlockManager::neededReconstruction. It should also contain the block number in BlockManager::pendingReconstruction, include the timeout items. Specially, in some scenario, for example, decommission a dn with a lot of ec blocks, there would be a lot blocks in pendingReconstruction at a long time but neededReconstruction's size may be 0. That will confuse user and they can't access the real decommissioning progress. > Under replicated blocks in dfsadmin report should contain > pendingReconstruction‘s blocks > > > Key: HDFS-16626 > URL: https://issues.apache.org/jira/browse/HDFS-16626 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > > In the output of command 'hdfs dfsadmin -report', the value of Under > replicated blocks and ec Low redundancy block groups only contains the block > number in BlockManager::neededReconstruction. It should also contain the > block number in BlockManager::pendingReconstruction, include the timeout > items. Specially, in some scenario, for example, decommission a dn with a lot > of ec blocks, there would be a lot blocks in pendingReconstruction at a long > time but neededReconstruction's size may be 0. That will confuse user and > they can't access the real decommissioning progress. > {code:java} > Configured Capacity: 1036741707829248 (942.91 TB) > Present Capacity: 983872491622400 (894.83 TB) > DFS Remaining: 974247450424426 (886.07 TB) > DFS Used: 9625041197974 (8.75 TB) > DFS Used%: 0.98% > Replicated Blocks: > Under replicated blocks: 0 > Blocks with corrupt replicas: 0 > Missing blocks: 0 > Missing blocks (with replication factor 1): 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 0 > Erasure Coded Block Groups: > Low redundancy block groups: 3481 > Block groups with corrupt internal blocks: 0 > Missing block groups: 0 > Low redundancy blocks with highest priority to recover: 0 > Pending deletion blocks: 245 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551381#comment-17551381 ] caozhiqiang edited comment on HDFS-16613 at 6/8/22 4:01 AM: [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send replication cmds to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. And its process interval is 3 seconds. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should change to use dfs.namenode.replication.max-streams-hard-limit to limit the task number. {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} *In other words, we should get blocks from pendingReconstruction to neededReconstruction in shorter interval(process 5). And should seed more replication tasks to datanode(process 2 and 6).* The below graph with under_replicated_blocks and pending_replicated_blocks metrics monitor in namenode, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and would be put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190! !image-2022-06-08-11-38-29-664.png|width=931,height=175! was (Author: caozhiqiang): [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send replication cmds to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. And its process interval is 3 seconds. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should change to use dfs.namenode.replication.max-streams-hard-limit to limit the task number. {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} *In other words, we should get blocks from pendingReconstruction to neededReconstruction in shorter interval(process 5). And seed more replication tasks to datanode(process 2 and 6).* The below graph with under_replicated_blocks and pending_replicated_blocks metrics monitor in namenode, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and would be put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190!
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551381#comment-17551381 ] caozhiqiang edited comment on HDFS-16613 at 6/8/22 3:58 AM: [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send replication cmds to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. And its process interval is 3 seconds. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should change to use dfs.namenode.replication.max-streams-hard-limit to limit the task number. {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} *In other words, we should get blocks from pendingReconstruction to neededReconstruction in shorter interval(process 5). And seed more replication tasks to datanode(process 2 and 6).* The below graph with under_replicated_blocks and pending_replicated_blocks metrics monitor in namenode, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and would be put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190! !image-2022-06-08-11-38-29-664.png|width=931,height=175! was (Author: caozhiqiang): [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send cmd to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. And its process interval is 3 seconds. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should use dfs.namenode.replication.max-streams-hard-limit to limit the task number. That mean we should take blocks from pendingReconstruction to neededReconstruction in shorten interval(process 5). And seed more replication tasks to datanode(process 2 and 6). {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} The below graph with under replicated blocks and pending replicated blocks metrics monitor, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and were put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190! !image-2022-06-08-11-38-29-664.png|width=931,height=175! > EC: Improve
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551381#comment-17551381 ] caozhiqiang edited comment on HDFS-16613 at 6/8/22 3:52 AM: [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send cmd to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. And its process interval is 3 seconds. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should use dfs.namenode.replication.max-streams-hard-limit to limit the task number. That mean we should take blocks from pendingReconstruction to neededReconstruction in shorten interval(process 5). And seed more replication tasks to datanode(process 2 and 6). {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} The below graph with under replicated blocks and pending replicated blocks metrics monitor, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and were put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190! !image-2022-06-08-11-38-29-664.png|width=931,height=175! was (Author: caozhiqiang): [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send cmd to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should use dfs.namenode.replication.max-streams-hard-limit to limit the task number. {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} The below graph with under replicated blocks and pending replicated blocks metrics monitor, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and were put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190! !image-2022-06-08-11-38-29-664.png|width=931,height=175! > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS >
[jira] [Commented] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551381#comment-17551381 ] caozhiqiang commented on HDFS-16613: [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send cmd to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should use dfs.namenode.replication.max-streams-hard-limit to limit the task number. {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} The below graph with under replicated blocks and pending replicated blocks metrics monitor, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and were put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190! !image-2022-06-08-11-38-29-664.png|width=931,height=175! > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, > image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Attachment: image-2022-06-08-11-41-11-127.png > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, > image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Attachment: image-2022-06-08-11-38-29-664.png > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, > image-2022-06-08-11-38-29-664.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550907#comment-17550907 ] caozhiqiang edited comment on HDFS-16613 at 6/7/22 10:52 AM: - [~hadachi] , thank you for your review. Firstly, my hadoop branch has included HDFS-14768. In my test, even the decommissioning node is made busy, ec blocks will not be reconstructed. It would not send ec task to datanode this time and only be reserved in BlockManager::pendingReconstruction. After timeout, these blocks will be put back to BlockManager::neededReconstruction and be rescheduled next time. So all blocks use replication on decommissioning node but not reconstruction. By the way, I decommission only one dn at a time. Secondly, there are 12 datanodes in my cluster, and each dn has 12 disks. There are 27217 ec block groups in my cluster and about 2 blocks in each datanode. Other nodes' load are very low beside the decommissioning node, include load average, cpu iowait and network. These can also illustrate that the blocks are replicated from the decommissioning node to other nodes. !image-2022-06-07-17-55-40-203.png|width=772,height=192! !image-2022-06-07-17-45-45-316.png|width=772,height=198! !image-2022-06-07-17-51-04-876.png|width=769,height=256! was (Author: caozhiqiang): [~hadachi] , thank you for your review. Firstly, my hadoop branch has included HDFS-14768. In my test, even the decommissioning node is made busy, ec blocks will not be reconstructed. It would not send ec task to datanode this time and only be reserved in BlockManager::pendingReconstruction. After timeout, these blocks will be put back to BlockManager::neededReconstruction and be rescheduled next time. So all blocks use replication on decommissioning node but not reconstruction. By the way, I decommission only one dn at a time. Secondly, there are 12 datanodes in my cluster, and each dn has 12 disks. There are 27217 ec block groups in my cluster and about 2 blocks in each datanode. Other nodes' load are very low beside the decommissioning node, include load average, cpu iowait and network. These also illustrate the blocks are replicated from the decommissioning node to other nodes. !image-2022-06-07-17-55-40-203.png|width=772,height=192! !image-2022-06-07-17-45-45-316.png|width=772,height=198! !image-2022-06-07-17-51-04-876.png|width=769,height=256! > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550907#comment-17550907 ] caozhiqiang edited comment on HDFS-16613 at 6/7/22 10:45 AM: - [~hadachi] , thank you for your review. Firstly, my hadoop branch has included HDFS-14768. In my test, even the decommissioning node is made busy, ec blocks will not be reconstructed. It would not send ec task to datanode this time and only be reserved in BlockManager::pendingReconstruction. After timeout, these blocks will be put back to BlockManager::neededReconstruction and be rescheduled next time. So all blocks use replication on decommissioning node but not reconstruction. By the way, I decommission only one dn at a time. Secondly, there are 12 datanodes in my cluster, and each dn has 12 disks. There are 27217 ec block groups in my cluster and about 2 blocks in each datanode. Other nodes' load are very low beside the decommissioning node, include load average, cpu iowait and network. These also illustrate the blocks are replicated from the decommissioning node to other nodes. !image-2022-06-07-17-55-40-203.png|width=772,height=192! !image-2022-06-07-17-45-45-316.png|width=772,height=198! !image-2022-06-07-17-51-04-876.png|width=769,height=256! was (Author: caozhiqiang): [~hadachi] , thank you for your review. Firstly, my hadoop branch has included HDFS-14768. In my test, even the decommissioning node is made busy, ec blocks will not be reconstructed. It would not send ec task to datanode and only be reserved in BlockManager::pendingReconstruction. After timeout, these blocks will be put back to BlockManager::neededReconstruction and be rescheduled next time. So all blocks use replication on decommissioning node but not reconstruction. By the way, I decommission only one dn at a time. Secondly, there are 12 datanodes in my cluster, and each dn has 12 disks. There are 27217 ec block groups in my cluster and about 2 blocks in one datanode. Other nodes' load are very low beside the decommissioning node, include load average, cpu iowait and network. !image-2022-06-07-17-55-40-203.png|width=772,height=192! !image-2022-06-07-17-45-45-316.png|width=772,height=198! !image-2022-06-07-17-51-04-876.png|width=769,height=256! > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550907#comment-17550907 ] caozhiqiang commented on HDFS-16613: [~hadachi] , thank you for your review. Firstly, my hadoop branch has included HDFS-14768. In my test, even the decommissioning node is made busy, ec blocks will not be reconstructed. It would not send ec task to datanode and only be reserved in BlockManager::pendingReconstruction. After timeout, these blocks will be put back to BlockManager::neededReconstruction and be rescheduled next time. So all blocks use replication on decommissioning node but not reconstruction. By the way, I decommission only one dn at a time. Secondly, there are 12 datanodes in my cluster, and each dn has 12 disks. There are 27217 ec block groups in my cluster and about 2 blocks in one datanode. Other nodes' load are very low beside the decommissioning node, include load average, cpu iowait and network. !image-2022-06-07-17-55-40-203.png|width=772,height=192! !image-2022-06-07-17-45-45-316.png|width=772,height=198! !image-2022-06-07-17-51-04-876.png|width=769,height=256! > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Attachment: image-2022-06-07-17-55-40-203.png > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Attachment: image-2022-06-07-17-51-04-876.png > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Attachment: image-2022-06-07-17-45-45-316.png > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Attachment: image-2022-06-07-17-42-16-075.png > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550782#comment-17550782 ] caozhiqiang edited comment on HDFS-16613 at 6/7/22 3:46 AM: In my cluster tests, the following optimizations would maximize the IO performance of the decommissioning DN. And the time spend by decommissioning a DN reduced from 3 hours to half an hour. # Add this patch # Increase the value of dfs.namenode.replication.max-streams-hard-limit # Decrease the value of dfs.namenode.reconstruction.pending.timeout-sec to shorten the time interval for checking pendingReconstructions. !image-2022-06-07-11-46-42-389.png|width=552,height=165! was (Author: caozhiqiang): In my cluster tests, the following optimizations would maximize the IO performance of the decommissioning DN. And the time spend by decommissioning a DN reduced from 3 hours to half an hour. # Add this patch # Increase the value of dfs.namenode.replication.max-streams-hard-limit # Decrease the value of dfs.namenode.reconstruction.pending.timeout-sec to shorten the time interval for checking pendingReconstructions. > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550782#comment-17550782 ] caozhiqiang commented on HDFS-16613: In my cluster tests, the following optimizations would maximize the IO performance of the decommissioning DN. And the time spend by decommissioning a DN reduced from 3 hours to half an hour. # Add this patch # Increase the value of dfs.namenode.replication.max-streams-hard-limit # Decrease the value of dfs.namenode.reconstruction.pending.timeout-sec to shorten the time interval for checking pendingReconstructions. > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545904#comment-17545904 ] caozhiqiang edited comment on HDFS-16613 at 6/3/22 3:30 PM: [~tasanuma] [~hadachi] , besides add a new configuration to limit decommissioning dn separately, we also can use dfs.namenode.replication.max-streams-hard-limit to impelements the same purpose. We only need to modify DatanodeManager::handleHeartbeat() and use dfs.namenode.replication.max-streams-hard-limit to give numReplicationTasks to decommissioning dn. I created a new pr [4398|https://github.com/apache/hadoop/pull/4398], please help to review it if you are free. {code:java} int maxTransfers; if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} was (Author: caozhiqiang): [~tasanuma] [~hadachi] , besides add a new configuration to limit decommissioning dn separately, we also can use dfs.namenode.replication.max-streams-hard-limit to impelements the same purpose. We only need to modify DatanodeManager::handleHeartbeat() and use dfs.namenode.replication.max-streams-hard-limit to give numReplicationTasks to decommissioning dn. I will create a new pr, please help to review it. {code:java} int maxTransfers; if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545904#comment-17545904 ] caozhiqiang edited comment on HDFS-16613 at 6/3/22 2:54 PM: [~tasanuma] [~hadachi] , besides add a new configuration to limit decommissioning dn separately, we also can use dfs.namenode.replication.max-streams-hard-limit to impelements the same purpose. We only need to modify DatanodeManager::handleHeartbeat() and use dfs.namenode.replication.max-streams-hard-limit to give numReplicationTasks to decommissioning dn. I will create a new pr, please help to review it. {code:java} int maxTransfers; if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} was (Author: caozhiqiang): [~tasanuma] [~hadachi] , besides add a new configuration to limit decommissioning dn separately, we also can use dfs.namenode.replication.max-streams-hard-limit to impelements the same purpose. We only need to modify DatanodeManager::handleHeartbeat() and use dfs.namenode.replication.max-streams-hard-limit to give numReplicationTasks to decommissioning dn. I will create a new pr, please help to review. {code:java} int maxTransfers; if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545904#comment-17545904 ] caozhiqiang commented on HDFS-16613: [~tasanuma] [~hadachi] , besides add a new configuration to limit decommissioning dn separately, we also can use dfs.namenode.replication.max-streams-hard-limit to impelements the same purpose. We only need to modify DatanodeManager::handleHeartbeat() and use dfs.namenode.replication.max-streams-hard-limit to give numReplicationTasks to decommissioning dn. I will create a new pr, please help to review. {code:java} int maxTransfers; if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Status: Patch Available (was: In Progress) > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16613 started by caozhiqiang. -- > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16613: --- Description: In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. The reason is unlike replication blocks can be replicated from any dn which has the same block replication, the ec block have to be replicated from the decommissioning dn. The configurations dfs.namenode.replication.max-streams and dfs.namenode.replication.max-streams-hard-limit will limit the replication speed, but increase these configurations will create risk to the whole cluster's network. So it should add a new configuration to limit the decommissioning dn, distinguished from the cluster wide max-streams limit. was:In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. The reason is unlike replication blocks can be replicated from any dn which has the same block replication, the ec block have to be replicated from the decommissioning dn. The configurations dfs.namenode.replication.max-streams and dfs.namenode.replication.max-streams-hard-limit will limit the replication speed, but increase these configurations will create risk to the whole cluster's network. So it should add a new configuration to limit the decommissioning dn, distinguished from the cluster wide max-streams limit. > EC: Improve performance of decommissioning dn with many ec blocks > - > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks
caozhiqiang created HDFS-16613: -- Summary: EC: Improve performance of decommissioning dn with many ec blocks Key: HDFS-16613 URL: https://issues.apache.org/jira/browse/HDFS-16613 Project: Hadoop HDFS Issue Type: Improvement Components: ec, erasure-coding, namenode Affects Versions: 3.4.0 Reporter: caozhiqiang Assignee: caozhiqiang In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. The reason is unlike replication blocks can be replicated from any dn which has the same block replication, the ec block have to be replicated from the decommissioning dn. The configurations dfs.namenode.replication.max-streams and dfs.namenode.replication.max-streams-hard-limit will limit the replication speed, but increase these configurations will create risk to the whole cluster's network. So it should add a new configuration to limit the decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515073#comment-17515073 ] caozhiqiang commented on HDFS-16456: [~tasanuma] , I have created a PR in [https://github.com/apache/hadoop/pull/4126.] It's my first time to use GitHub PR, please help to check if I made a mistake. Thank you very much! > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Labels: pull-request-available > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, > HDFS-16456.009.patch, HDFS-16456.010.patch > > Time Spent: 20m > Remaining Estimate: 0h > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514492#comment-17514492 ] caozhiqiang commented on HDFS-16456: [~tasanuma] I have modify this patch in [^HDFS-16456.010.patch], please review. For question 4, in below scenario will get an error result: # Decommission a datanode, which is the only one node in its rack. and the numOfEmptyRacks will +1. # Stop this datanode, and the numOfEmptyRacks will -1 because this rack will also be removed from emptyRackMap. # Start this datanode, this rack and this node will be both added to emptyRackMap but decommissionNode() will not be called again. The numOfEmptyRacks will not change. Error is occured, because this node is also decommissioned and its rack should be considered empty and numOfEmptyRacks should +1. So I use decommissionNodes to check if a new added node is a decommissioned one. > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, > HDFS-16456.009.patch, HDFS-16456.010.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.010.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, > HDFS-16456.009.patch, HDFS-16456.010.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, > HDFS-16456.009.patch, HDFS-16456.010.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, > HDFS-16456.009.patch, HDFS-16456.010.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512981#comment-17512981 ] caozhiqiang commented on HDFS-16456: [~tasanuma] Thank you for your review. I have modified this patch according to your advice. In addition, I optimize the logic of interAddNodeWithEmptyRack() and interRemoveNodeWithEmptyRack() to handle some special scenario such as decommission, stop and start to the same node repeatedly. And I use two ways to implement it which is in [^HDFS-16456.008.patch] and [^HDFS-16456.009.patch]. Both of them can work fine and I prefer [^HDFS-16456.009.patch] because its logic is simpler and easier to understand. Please help give your advice. > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, > HDFS-16456.009.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.009.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, > HDFS-16456.009.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.008.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: (was: HDFS-16456.008.patch) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.008.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506737#comment-17506737 ] caozhiqiang commented on HDFS-16456: [~tasanuma] , Please help to review this patch if you have time, Thanks.:) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16497) EC: Add param comment for liveBusyBlockIndices with HDFS-14768
[ https://issues.apache.org/jira/browse/HDFS-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16497: --- Status: Open (was: Patch Available) > EC: Add param comment for liveBusyBlockIndices with HDFS-14768 > -- > > Key: HDFS-16497 > URL: https://issues.apache.org/jira/browse/HDFS-16497 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Minor > Attachments: HDFS-16497.001.patch > > > In HDFS-14768, BlockManager::getDatanodeDescriptorFromStorage() function > should add param comment for liveBusyBlockIndices. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16497) EC: Add param comment for liveBusyBlockIndices with HDFS-14768
[ https://issues.apache.org/jira/browse/HDFS-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16497: --- Status: Patch Available (was: Open) > EC: Add param comment for liveBusyBlockIndices with HDFS-14768 > -- > > Key: HDFS-16497 > URL: https://issues.apache.org/jira/browse/HDFS-16497 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Minor > Attachments: HDFS-16497.001.patch > > > In HDFS-14768, BlockManager::getDatanodeDescriptorFromStorage() function > should add param comment for liveBusyBlockIndices. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16497) EC: Add param comment for liveBusyBlockIndices with HDFS-14768
[ https://issues.apache.org/jira/browse/HDFS-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16497: --- Status: Patch Available (was: Open) > EC: Add param comment for liveBusyBlockIndices with HDFS-14768 > -- > > Key: HDFS-16497 > URL: https://issues.apache.org/jira/browse/HDFS-16497 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Minor > Attachments: HDFS-16497.001.patch > > > In HDFS-14768, BlockManager::getDatanodeDescriptorFromStorage() function > should add param comment for liveBusyBlockIndices. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16497) EC: Add param comment for liveBusyBlockIndices with HDFS-14768
[ https://issues.apache.org/jira/browse/HDFS-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16497: --- Attachment: HDFS-16497.001.patch > EC: Add param comment for liveBusyBlockIndices with HDFS-14768 > -- > > Key: HDFS-16497 > URL: https://issues.apache.org/jira/browse/HDFS-16497 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Minor > Attachments: HDFS-16497.001.patch > > > In HDFS-14768, BlockManager::getDatanodeDescriptorFromStorage() function > should add param comment for liveBusyBlockIndices. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16497) EC: Add param comment for liveBusyBlockIndices with HDFS-14768
[ https://issues.apache.org/jira/browse/HDFS-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16497: --- External issue ID: (was: HDFS-14768) External issue URL: (was: https://issues.apache.org/jira/browse/HDFS-14768) > EC: Add param comment for liveBusyBlockIndices with HDFS-14768 > -- > > Key: HDFS-16497 > URL: https://issues.apache.org/jira/browse/HDFS-16497 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, namanode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Minor > > In HDFS-14768, BlockManager::getDatanodeDescriptorFromStorage() function > should add param comment for liveBusyBlockIndices. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16497) EC: Add param comment for liveBusyBlockIndices with HDFS-14768
caozhiqiang created HDFS-16497: -- Summary: EC: Add param comment for liveBusyBlockIndices with HDFS-14768 Key: HDFS-16497 URL: https://issues.apache.org/jira/browse/HDFS-16497 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding, namanode Affects Versions: 3.4.0 Reporter: caozhiqiang In HDFS-14768, BlockManager::getDatanodeDescriptorFromStorage() function should add param comment for liveBusyBlockIndices. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.007.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch, HDFS-16456.007.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: (was: HDFS-16456.006.patch) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.006.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498527#comment-17498527 ] caozhiqiang commented on HDFS-16456: [~tasanuma] , there are still some hdfs UT failed. These don't seem to be related to my modifications. Could you give me some suggestion? > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.006.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: (was: HDFS-16456.006.patch) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.006.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: (was: HDFS-16456.006.patch) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.006.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, > HDFS-16456.006.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: HDFS-16456.005.patch > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: (was: HDFS-16456.005.patch) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Open (was: Patch Available) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Status: Patch Available (was: Open) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated HDFS-16456: --- Attachment: (was: HDFS-16456.005.patch) > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, > HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In verifyBlockPlacement, we need to remove invalid racks from total > numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail > to reconstruct data. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org