[jira] [Commented] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap
[ https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374521#comment-17374521 ] Hudson commented on HDFS-16101: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 37s{color} | | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} yetus {color} | {color:red} 0m 7s{color} | | {color:red} Unprocessed flag(s): --mvn-custom-repos-dir {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/665/artifact/out/Dockerfile | | JIRA Issue | HDFS-16101 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13027419/HDFS-16101.001.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/665/console | | versions | git=2.25.1 | | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Remove unuse variable and IoException in ProvidedStorageMap > --- > > Key: HDFS-16101 > URL: https://issues.apache.org/jira/browse/HDFS-16101 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16101.001.patch > > > Remove unuse variable and IoException in ProvidedStorageMap -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16108) Incorrect log placeholders used in JournalNodeSyncer
[ https://issues.apache.org/jira/browse/HDFS-16108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated HDFS-16108: - Fix Version/s: 3.3.2 3.2.3 Backported to branch-3.3 and branch-3.2 > Incorrect log placeholders used in JournalNodeSyncer > > > Key: HDFS-16108 > URL: https://issues.apache.org/jira/browse/HDFS-16108 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.2 > > Time Spent: 1.5h > Remaining Estimate: 0h > > When Journal sync thread is using incorrect log placeholders at 2 places: > # When it fails to create dir for downloading log segments > # When it fails to move tmp editFile to current dir > Since these failure logs are important to debug JN sync issues, we should fix > these incorrect placeholders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap
[ https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374513#comment-17374513 ] Hudson commented on HDFS-16101: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 38s{color} | | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} yetus {color} | {color:red} 0m 7s{color} | | {color:red} Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --html-report-file --mvn-custom-repos --shelldocs --mvn-javadoc-goals --mvn-custom-repos-dir {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/664/artifact/out/Dockerfile | | JIRA Issue | HDFS-16101 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13027419/HDFS-16101.001.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/664/console | | versions | git=2.25.1 | | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Remove unuse variable and IoException in ProvidedStorageMap > --- > > Key: HDFS-16101 > URL: https://issues.apache.org/jira/browse/HDFS-16101 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16101.001.patch > > > Remove unuse variable and IoException in ProvidedStorageMap -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap
[ https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374511#comment-17374511 ] Hudson commented on HDFS-16101: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 40s{color} | | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} yetus {color} | {color:red} 0m 7s{color} | | {color:red} Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --html-report-file --mvn-custom-repos --shelldocs --mvn-javadoc-goals --mvn-custom-repos-dir {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/662/artifact/out/Dockerfile | | JIRA Issue | HDFS-16101 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13027419/HDFS-16101.001.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/662/console | | versions | git=2.25.1 | | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Remove unuse variable and IoException in ProvidedStorageMap > --- > > Key: HDFS-16101 > URL: https://issues.apache.org/jira/browse/HDFS-16101 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16101.001.patch > > > Remove unuse variable and IoException in ProvidedStorageMap -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-16110?focusedWorklogId=618499=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618499 ] ASF GitHub Bot logged work on HDFS-16110: - Author: ASF GitHub Bot Created on: 05/Jul/21 04:06 Start Date: 05/Jul/21 04:06 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3174: URL: https://github.com/apache/hadoop/pull/3174#issuecomment-873767890 Hi @aajisaka @tasanuma @jojochuang @ferhui , could you please review the code? Thanks a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618499) Time Spent: 40m (was: 0.5h) > Remove unused method reportChecksumFailure in DFSClient > --- > > Key: HDFS-16110 > URL: https://issues.apache.org/jira/browse/HDFS-16110 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Remove unused method reportChecksumFailure and fix some code styles by the > way in DFSClient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16113) Improve CallQueueManager#swapQueue() execution performance
JiangHua Zhu created HDFS-16113: --- Summary: Improve CallQueueManager#swapQueue() execution performance Key: HDFS-16113 URL: https://issues.apache.org/jira/browse/HDFS-16113 Project: Hadoop HDFS Issue Type: Improvement Reporter: JiangHua Zhu In CallQueueManager#swapQueue(), there are some codes: CallQueueManager#swapQueue() { .. while (!queueIsReallyEmpty(oldQ)) {} .. } In queueIsReallyEmpty(): .. for (int i = 0; i
[jira] [Assigned] (HDFS-16113) Improve CallQueueManager#swapQueue() execution performance
[ https://issues.apache.org/jira/browse/HDFS-16113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16113: --- Assignee: JiangHua Zhu > Improve CallQueueManager#swapQueue() execution performance > -- > > Key: HDFS-16113 > URL: https://issues.apache.org/jira/browse/HDFS-16113 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > > In CallQueueManager#swapQueue(), there are some codes: > CallQueueManager#swapQueue() { > .. > while (!queueIsReallyEmpty(oldQ)) {} > .. > } > In queueIsReallyEmpty(): > .. > for (int i = 0; i ... > We found that this implementation has certain performance hindrances in real > clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load
[ https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=618493=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618493 ] ASF GitHub Bot logged work on HDFS-16088: - Author: ASF GitHub Bot Created on: 05/Jul/21 03:36 Start Date: 05/Jul/21 03:36 Worklog Time Spent: 10m Work Description: ferhui commented on pull request #3140: URL: https://github.com/apache/hadoop/pull/3140#issuecomment-873759215 LGTM. @Hexiaoqiao Could you please take another look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618493) Time Spent: 2h 20m (was: 2h 10m) > Standby NameNode process getLiveDatanodeStorageReport request to reduce > Active load > --- > > Key: HDFS-16088 > URL: https://issues.apache.org/jira/browse/HDFS-16088 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: standyby-ipcserver.jpg > > Time Spent: 2h 20m > Remaining Estimate: 0h > > As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also > request to SNN to reduce the ANN load. > There are two points that need to be mentioned: > 1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, > so we can access SNN directly. > 2. We can share the same UT(testBalancerRequestSBNWithHA) with > NameNodeConnector#getBlocks(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16023) Improve blockReportLeaseId acquisition to avoid repeated FBR
[ https://issues.apache.org/jira/browse/HDFS-16023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16023 started by JiangHua Zhu. --- > Improve blockReportLeaseId acquisition to avoid repeated FBR > > > Key: HDFS-16023 > URL: https://issues.apache.org/jira/browse/HDFS-16023 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When the NameNode receives the data (FBR) from the DataNode, it will put the > data in the queue (BlockReportProcessingThread#queue), and there will be > threads processing them thereafter. > When the DataNode wants to send data (here, FBR) to the NameNode, it will > first obtain a blockReportLeaseId from the NameNode. If the DataNode data > already exists in the queue, there is no need to assign a blockReportLeaseId > to the DataNode again. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-16110?focusedWorklogId=618486=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618486 ] ASF GitHub Bot logged work on HDFS-16110: - Author: ASF GitHub Bot Created on: 05/Jul/21 02:35 Start Date: 05/Jul/21 02:35 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3174: URL: https://github.com/apache/hadoop/pull/3174#issuecomment-873738971 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 30m 50s | | trunk passed | | +1 :green_heart: | compile | 1m 0s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | compile | 0m 55s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | checkstyle | 0m 30s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 59s | | trunk passed | | +1 :green_heart: | javadoc | 0m 42s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 0m 38s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 2m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 15m 33s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 48s | | the patch passed | | +1 :green_heart: | compile | 0m 53s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javac | 0m 53s | | the patch passed | | +1 :green_heart: | compile | 0m 44s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | javac | 0m 44s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 19s | | hadoop-hdfs-project/hadoop-hdfs-client: The patch generated 0 new + 41 unchanged - 5 fixed = 41 total (was 46) | | +1 :green_heart: | mvnsite | 0m 47s | | the patch passed | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 0m 29s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 2m 29s | | the patch passed | | +1 :green_heart: | shadedclient | 15m 31s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 20s | | hadoop-hdfs-client in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 78m 56s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3174/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/3174 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux b1eab1986fb6 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 74f772a4c74c6a880ebfb71feda4cae935175983 | | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3174/2/testReport/ | | Max. process+thread count | 545 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-client U: hadoop-hdfs-project/hadoop-hdfs-client | |
[jira] [Work started] (HDFS-16107) Split RPC configuration to isolate RPC
[ https://issues.apache.org/jira/browse/HDFS-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16107 started by JiangHua Zhu. --- > Split RPC configuration to isolate RPC > -- > > Key: HDFS-16107 > URL: https://issues.apache.org/jira/browse/HDFS-16107 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > For RPC of different ports, there are some common configurations, such as: > ipc.server.read.threadpool.size > ipc.server.read.connection-queue.size > ipc.server.handler.queue.size > Once we configure these values, it will affect all requests (including client > and requests within the cluster). > It is necessary for us to split these configurations to adapt to different > ports, such as: > ipc.8020.server.read.threadpool.size > ipc.8021.server.read.threadpool.size > ipc.8020.server.read.connection-queue.size > ipc.8021.server.read.connection-queue.size > The advantage of this is to isolate the RPC to deal with the pressure of > requests from all sides. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16107) Split RPC configuration to isolate RPC
[ https://issues.apache.org/jira/browse/HDFS-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374447#comment-17374447 ] JiangHua Zhu commented on HDFS-16107: - [~weichiu] [~sodonnell] [~hexiaoqiao], do you have any new suggestions? Also, if possible, please help review the code I submitted. thank you very much. > Split RPC configuration to isolate RPC > -- > > Key: HDFS-16107 > URL: https://issues.apache.org/jira/browse/HDFS-16107 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > For RPC of different ports, there are some common configurations, such as: > ipc.server.read.threadpool.size > ipc.server.read.connection-queue.size > ipc.server.handler.queue.size > Once we configure these values, it will affect all requests (including client > and requests within the cluster). > It is necessary for us to split these configurations to adapt to different > ports, such as: > ipc.8020.server.read.threadpool.size > ipc.8021.server.read.threadpool.size > ipc.8020.server.read.connection-queue.size > ipc.8021.server.read.connection-queue.size > The advantage of this is to isolate the RPC to deal with the pressure of > requests from all sides. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load
[ https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=618470=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618470 ] ASF GitHub Bot logged work on HDFS-16088: - Author: ASF GitHub Bot Created on: 05/Jul/21 01:54 Start Date: 05/Jul/21 01:54 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3140: URL: https://github.com/apache/hadoop/pull/3140#issuecomment-873724526 Hi @ferhui @Hexiaoqiao , I extracted a new method and added an seperate UT. Could you please review again? Thanks a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618470) Time Spent: 2h 10m (was: 2h) > Standby NameNode process getLiveDatanodeStorageReport request to reduce > Active load > --- > > Key: HDFS-16088 > URL: https://issues.apache.org/jira/browse/HDFS-16088 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: standyby-ipcserver.jpg > > Time Spent: 2h 10m > Remaining Estimate: 0h > > As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also > request to SNN to reduce the ANN load. > There are two points that need to be mentioned: > 1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, > so we can access SNN directly. > 2. We can share the same UT(testBalancerRequestSBNWithHA) with > NameNodeConnector#getBlocks(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load
[ https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=618469=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618469 ] ASF GitHub Bot logged work on HDFS-16088: - Author: ASF GitHub Bot Created on: 05/Jul/21 01:51 Start Date: 05/Jul/21 01:51 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3140: URL: https://github.com/apache/hadoop/pull/3140#issuecomment-873723601 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 32s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 30m 48s | | trunk passed | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | compile | 1m 17s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | checkstyle | 1m 3s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 21s | | trunk passed | | +1 :green_heart: | javadoc | 0m 57s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 1m 26s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 3m 5s | | trunk passed | | +1 :green_heart: | shadedclient | 16m 14s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 11s | | the patch passed | | +1 :green_heart: | compile | 1m 14s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javac | 1m 14s | | the patch passed | | +1 :green_heart: | compile | 1m 6s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | javac | 1m 6s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 54s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 13s | | the patch passed | | +1 :green_heart: | javadoc | 0m 47s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 1m 21s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 3m 9s | | the patch passed | | +1 :green_heart: | shadedclient | 15m 49s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 230m 32s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 45s | | The patch does not generate ASF License warnings. | | | | 314m 1s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3140/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/3140 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux 80e59a19ec5e 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / aa944c03b6a5817905ef89506820dd512b47a1bf | | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3140/5/testReport/ | | Max. process+thread count | 3110 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3140/5/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org | This
[jira] [Comment Edited] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
[ https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374443#comment-17374443 ] tomscut edited comment on HDFS-16112 at 7/5/21, 1:50 AM: - Hi [~sodonnell], these unit test TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and TestDecommissioningStatus#testDecommissionStatus recently seems a little flaky, could you please take a look when you have time. Thanks a lot. was (Author: tomscut): Hi [~sodonnell], the unit test TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and TestDecommissioningStatus#testDecommissionStatus recently seems a little flaky, could you please take a look when you have time. Thanks a lot. > Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor > > > Key: HDFS-16112 > URL: https://issues.apache.org/jira/browse/HDFS-16112 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Priority: Minor > > These unit tests > TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and > TestDecommissioningStatus#testDecommissionStatus recently seems a little > flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
[ https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16112: --- Description: These unit tests TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and TestDecommissioningStatus#testDecommissionStatus recently seems a little flaky, we should fix them. (was: The unit test TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and TestDecommissioningStatus#testDecommissionStatus recently seems a little flaky, we should fix them.) > Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor > > > Key: HDFS-16112 > URL: https://issues.apache.org/jira/browse/HDFS-16112 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Priority: Minor > > These unit tests > TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and > TestDecommissioningStatus#testDecommissionStatus recently seems a little > flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
[ https://issues.apache.org/jira/browse/HDFS-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374443#comment-17374443 ] tomscut commented on HDFS-16112: Hi [~sodonnell], the unit test TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and TestDecommissioningStatus#testDecommissionStatus recently seems a little flaky, could you please take a look when you have time. Thanks a lot. > Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor > > > Key: HDFS-16112 > URL: https://issues.apache.org/jira/browse/HDFS-16112 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Priority: Minor > > The unit test > TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and > TestDecommissioningStatus#testDecommissionStatus recently seems a little > flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
tomscut created HDFS-16112: -- Summary: Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor Key: HDFS-16112 URL: https://issues.apache.org/jira/browse/HDFS-16112 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut The unit test TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and TestDecommissioningStatus#testDecommissionStatus recently seems a little flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16108) Incorrect log placeholders used in JournalNodeSyncer
[ https://issues.apache.org/jira/browse/HDFS-16108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui Fei resolved HDFS-16108. Fix Version/s: 3.4.0 Resolution: Fixed > Incorrect log placeholders used in JournalNodeSyncer > > > Key: HDFS-16108 > URL: https://issues.apache.org/jira/browse/HDFS-16108 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > When Journal sync thread is using incorrect log placeholders at 2 places: > # When it fails to create dir for downloading log segments > # When it fails to move tmp editFile to current dir > Since these failure logs are important to debug JN sync issues, we should fix > these incorrect placeholders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16108) Incorrect log placeholders used in JournalNodeSyncer
[ https://issues.apache.org/jira/browse/HDFS-16108?focusedWorklogId=618467=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618467 ] ASF GitHub Bot logged work on HDFS-16108: - Author: ASF GitHub Bot Created on: 05/Jul/21 01:23 Start Date: 05/Jul/21 01:23 Worklog Time Spent: 10m Work Description: ferhui commented on pull request #3169: URL: https://github.com/apache/hadoop/pull/3169#issuecomment-873715651 @virajjasani Thanks for contribution. @aajisaka @tomscut Thanks for review. Merged to trunk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618467) Time Spent: 1h 20m (was: 1h 10m) > Incorrect log placeholders used in JournalNodeSyncer > > > Key: HDFS-16108 > URL: https://issues.apache.org/jira/browse/HDFS-16108 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > When Journal sync thread is using incorrect log placeholders at 2 places: > # When it fails to create dir for downloading log segments > # When it fails to move tmp editFile to current dir > Since these failure logs are important to debug JN sync issues, we should fix > these incorrect placeholders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16108) Incorrect log placeholders used in JournalNodeSyncer
[ https://issues.apache.org/jira/browse/HDFS-16108?focusedWorklogId=618468=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618468 ] ASF GitHub Bot logged work on HDFS-16108: - Author: ASF GitHub Bot Created on: 05/Jul/21 01:23 Start Date: 05/Jul/21 01:23 Worklog Time Spent: 10m Work Description: ferhui merged pull request #3169: URL: https://github.com/apache/hadoop/pull/3169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618468) Time Spent: 1.5h (was: 1h 20m) > Incorrect log placeholders used in JournalNodeSyncer > > > Key: HDFS-16108 > URL: https://issues.apache.org/jira/browse/HDFS-16108 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > When Journal sync thread is using incorrect log placeholders at 2 places: > # When it fails to create dir for downloading log segments > # When it fails to move tmp editFile to current dir > Since these failure logs are important to debug JN sync issues, we should fix > these incorrect placeholders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16109) Fix flaky some unit tests since they offen timeout
[ https://issues.apache.org/jira/browse/HDFS-16109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374440#comment-17374440 ] tomscut commented on HDFS-16109: Thanks [~aajisaka] for the merge. > Fix flaky some unit tests since they offen timeout > -- > > Key: HDFS-16109 > URL: https://issues.apache.org/jira/browse/HDFS-16109 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Increase timeout for TestBootstrapStandby, TestFsVolumeList and > TestDecommissionWithBackoffMonitor since they offen timeout. > > TestBootstrapStandby: > {code:java} > [ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 159.474 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] Tests > run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< > FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] > testRateThrottling(org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby) > Time elapsed: 31.262 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 3 milliseconds at java.io.RandomAccessFile.writeBytes(Native Method) at > java.io.RandomAccessFile.write(RandomAccessFile.java:512) at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:947) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:910) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:699) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:642) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:387) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1224) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:795) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:673) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:760) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1014) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:989) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1763) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2261) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2231) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby.testRateThrottling(TestBootstrapStandby.java:297) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) > {code} > TestFsVolumeList: > {code:java} > [ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 190.294 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 190.294 s > <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > testAddRplicaProcessorForAddingReplicaInMap(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList) > Time elapsed: 60.028 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 6 milliseconds at sun.misc.Unsafe.park(Native Method) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at >
[jira] [Updated] (HDFS-16109) Fix flaky some unit tests since they offen timeout
[ https://issues.apache.org/jira/browse/HDFS-16109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated HDFS-16109: - Issue Type: Bug (was: Wish) > Fix flaky some unit tests since they offen timeout > -- > > Key: HDFS-16109 > URL: https://issues.apache.org/jira/browse/HDFS-16109 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Increase timeout for TestBootstrapStandby, TestFsVolumeList and > TestDecommissionWithBackoffMonitor since they offen timeout. > > TestBootstrapStandby: > {code:java} > [ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 159.474 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] Tests > run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< > FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] > testRateThrottling(org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby) > Time elapsed: 31.262 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 3 milliseconds at java.io.RandomAccessFile.writeBytes(Native Method) at > java.io.RandomAccessFile.write(RandomAccessFile.java:512) at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:947) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:910) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:699) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:642) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:387) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1224) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:795) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:673) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:760) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1014) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:989) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1763) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2261) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2231) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby.testRateThrottling(TestBootstrapStandby.java:297) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) > {code} > TestFsVolumeList: > {code:java} > [ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 190.294 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 190.294 s > <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > testAddRplicaProcessorForAddingReplicaInMap(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList) > Time elapsed: 60.028 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 6 milliseconds at sun.misc.Unsafe.park(Native Method) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at > java.util.concurrent.FutureTask.get(FutureTask.java:191) at >
[jira] [Updated] (HDFS-16109) Fix flaky some unit tests since they offen timeout
[ https://issues.apache.org/jira/browse/HDFS-16109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated HDFS-16109: - Component/s: test > Fix flaky some unit tests since they offen timeout > -- > > Key: HDFS-16109 > URL: https://issues.apache.org/jira/browse/HDFS-16109 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Increase timeout for TestBootstrapStandby, TestFsVolumeList and > TestDecommissionWithBackoffMonitor since they offen timeout. > > TestBootstrapStandby: > {code:java} > [ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 159.474 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] Tests > run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< > FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] > testRateThrottling(org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby) > Time elapsed: 31.262 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 3 milliseconds at java.io.RandomAccessFile.writeBytes(Native Method) at > java.io.RandomAccessFile.write(RandomAccessFile.java:512) at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:947) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:910) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:699) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:642) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:387) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1224) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:795) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:673) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:760) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1014) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:989) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1763) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2261) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2231) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby.testRateThrottling(TestBootstrapStandby.java:297) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) > {code} > TestFsVolumeList: > {code:java} > [ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 190.294 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 190.294 s > <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > testAddRplicaProcessorForAddingReplicaInMap(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList) > Time elapsed: 60.028 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 6 milliseconds at sun.misc.Unsafe.park(Native Method) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at >
[jira] [Resolved] (HDFS-16109) Fix flaky some unit tests since they offen timeout
[ https://issues.apache.org/jira/browse/HDFS-16109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka resolved HDFS-16109. -- Fix Version/s: 3.3.2 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Committed to trunk and branch-3.3. Thank you [~tomscut] for your contribution. > Fix flaky some unit tests since they offen timeout > -- > > Key: HDFS-16109 > URL: https://issues.apache.org/jira/browse/HDFS-16109 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Increase timeout for TestBootstrapStandby, TestFsVolumeList and > TestDecommissionWithBackoffMonitor since they offen timeout. > > TestBootstrapStandby: > {code:java} > [ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 159.474 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] Tests > run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< > FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] > testRateThrottling(org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby) > Time elapsed: 31.262 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 3 milliseconds at java.io.RandomAccessFile.writeBytes(Native Method) at > java.io.RandomAccessFile.write(RandomAccessFile.java:512) at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:947) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:910) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:699) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:642) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:387) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1224) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:795) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:673) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:760) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1014) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:989) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1763) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2261) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2231) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby.testRateThrottling(TestBootstrapStandby.java:297) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) > {code} > TestFsVolumeList: > {code:java} > [ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 190.294 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 190.294 s > <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] > testAddRplicaProcessorForAddingReplicaInMap(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList) > Time elapsed: 60.028 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 6 milliseconds at sun.misc.Unsafe.park(Native Method) at >
[jira] [Work logged] (HDFS-16109) Fix flaky some unit tests since they offen timeout
[ https://issues.apache.org/jira/browse/HDFS-16109?focusedWorklogId=618461=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618461 ] ASF GitHub Bot logged work on HDFS-16109: - Author: ASF GitHub Bot Created on: 04/Jul/21 23:14 Start Date: 04/Jul/21 23:14 Worklog Time Spent: 10m Work Description: aajisaka merged pull request #3172: URL: https://github.com/apache/hadoop/pull/3172 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618461) Time Spent: 40m (was: 0.5h) > Fix flaky some unit tests since they offen timeout > -- > > Key: HDFS-16109 > URL: https://issues.apache.org/jira/browse/HDFS-16109 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Increase timeout for TestBootstrapStandby, TestFsVolumeList and > TestDecommissionWithBackoffMonitor since they offen timeout. > > TestBootstrapStandby: > {code:java} > [ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 159.474 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] Tests > run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< > FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] > testRateThrottling(org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby) > Time elapsed: 31.262 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 3 milliseconds at java.io.RandomAccessFile.writeBytes(Native Method) at > java.io.RandomAccessFile.write(RandomAccessFile.java:512) at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:947) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:910) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:699) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:642) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:387) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1224) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:795) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:673) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:760) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1014) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:989) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1763) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2261) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2231) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby.testRateThrottling(TestBootstrapStandby.java:297) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) > {code} > TestFsVolumeList: > {code:java} > [ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 190.294 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] >
[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.
[ https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16111: -- Labels: pull-request-available (was: ) > Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes > at datanodes. > --- > > Key: HDFS-16111 > URL: https://issues.apache.org/jira/browse/HDFS-16111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Zhihai Xu >Assignee: Zhihai Xu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got > failed volume on a lot of datanodes, which cause some missing blocks at that > time. Although later on we recovered all the missing blocks by symlinking the > path (dfs/dn/current) on the failed volume to a new directory and copying all > the data to the new directory, we missed our SLA and it delayed our upgrading > process on our production cluster for several hours. > When this issue happened, we saw a lot of this exceptions happened before the > volumed failed on the datanode: > [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] > [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] > datanode.DataNode (BlockReceiver.java:(289)) - IOException in > BlockReceiver constructor :Possible disk error: Failed to create > /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause > is > java.io.IOException: No space left on device > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1012) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) > at java.lang.Thread.run(Thread.java:748) > > We found this issue happened due to the following two reasons: > First the upgrade process added some extra disk storage on the each disk > volume of the data node: > BlockPoolSliceStorage.doUpgrade > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445) > is the main upgrade function in the datanode, it will add some extra > storage. The extra storage added is all new directories created in > /current//current, although all block data file and block meta data > file are hard-linked with /current//previous after upgrade. Since there > will be a lot of new directories created, this will use some disk space on > each disk volume. > > Second there is a potential bug when picking a disk volume to write a new > block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, > The code to select a disk will check whether the available space on the > selected disk is more than the size bytes of block file to store > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86) > But when creating a new block, there will be two files created: one is the > block file blk_, the other is block metadata file blk__.meta, > this is the code when finalizing a block, both block file size and meta data > file size will be updated: > https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391 >
[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.
[ https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=618459=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618459 ] ASF GitHub Bot logged work on HDFS-16111: - Author: ASF GitHub Bot Created on: 04/Jul/21 22:36 Start Date: 04/Jul/21 22:36 Worklog Time Spent: 10m Work Description: zhihaixu2012 opened a new pull request #3175: URL: https://github.com/apache/hadoop/pull/3175 …avoid failed volumes at datanodes. Change-Id: Iead25812d4073e3980893e3e76f7d2b03b57442a JIRA: https://issues.apache.org/jira/browse/HDFS-16111 there is a potential bug when picking a disk volume to write a new block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, The code to select a disk will check whether the available space on the selected disk is more than the size bytes of block file to store (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86) But when creating a new block, there will be two files created: one is the block file blk_, the other is block metadata file blk__.meta, this is the code when finalizing a block, both block file size and meta data file size will be updated: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391 the current code only considers the size of block file and doesn't consider the size of block metadata file, when choosing a disk in RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks received at the same time, the default maximum number of DataXceiver threads is 4096. This will underestimate the total size needed to write a block, which will potentially cause the disk full error(No space left on device) when writing a replica. Since the size of the block metadata file is not fixed, I suggest to add a configuration(dfs.datanode.round-robin-volume-choosing-policy.additional-available-space) to safeguard the disk space when choosing a volume to write a new block data in RoundRobinVolumeChoosingPolicy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618459) Remaining Estimate: 0h Time Spent: 10m > Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes > at datanodes. > --- > > Key: HDFS-16111 > URL: https://issues.apache.org/jira/browse/HDFS-16111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Zhihai Xu >Assignee: Zhihai Xu >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got > failed volume on a lot of datanodes, which cause some missing blocks at that > time. Although later on we recovered all the missing blocks by symlinking the > path (dfs/dn/current) on the failed volume to a new directory and copying all > the data to the new directory, we missed our SLA and it delayed our upgrading > process on our production cluster for several hours. > When this issue happened, we saw a lot of this exceptions happened before the > volumed failed on the datanode: > [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] > [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] > datanode.DataNode (BlockReceiver.java:(289)) - IOException in > BlockReceiver constructor :Possible disk error: Failed to create > /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause > is > java.io.IOException: No space left on device > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1012) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532) > at >
[jira] [Work logged] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-16110?focusedWorklogId=618458=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618458 ] ASF GitHub Bot logged work on HDFS-16110: - Author: ASF GitHub Bot Created on: 04/Jul/21 21:56 Start Date: 04/Jul/21 21:56 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3174: URL: https://github.com/apache/hadoop/pull/3174#issuecomment-873669979 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 31m 40s | | trunk passed | | +1 :green_heart: | compile | 1m 1s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | compile | 0m 52s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | checkstyle | 0m 30s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 58s | | trunk passed | | +1 :green_heart: | javadoc | 0m 43s | | trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 0m 38s | | trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 2m 29s | | trunk passed | | +1 :green_heart: | shadedclient | 15m 25s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 47s | | the patch passed | | +1 :green_heart: | compile | 0m 52s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javac | 0m 52s | | the patch passed | | +1 :green_heart: | compile | 0m 45s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | javac | 0m 45s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 19s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3174/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-client.txt) | hadoop-hdfs-project/hadoop-hdfs-client: The patch generated 1 new + 41 unchanged - 5 fixed = 42 total (was 46) | | +1 :green_heart: | mvnsite | 0m 47s | | the patch passed | | +1 :green_heart: | javadoc | 0m 33s | | the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 | | +1 :green_heart: | javadoc | 0m 30s | | the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | +1 :green_heart: | spotbugs | 2m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 15m 26s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 19s | | hadoop-hdfs-client in the patch passed. | | +1 :green_heart: | asflicense | 0m 34s | | The patch does not generate ASF License warnings. | | | | 79m 45s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3174/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/3174 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux 04166e58be95 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 4fc88b18a6d1c1f3c19abd858954068099394452 | | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 | | Test Results |
[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.
[ https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihai Xu updated HDFS-16111: - Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes. (was: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanode.) > Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes > at datanodes. > --- > > Key: HDFS-16111 > URL: https://issues.apache.org/jira/browse/HDFS-16111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Zhihai Xu >Assignee: Zhihai Xu >Priority: Major > > When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got > failed volume on a lot of datanodes, which cause some missing blocks at that > time. Although later on we recovered all the missing blocks by symlinking the > path (dfs/dn/current) on the failed volume to a new directory and copying all > the data to the new directory, we missed our SLA and it delayed our upgrading > process on our production cluster for several hours. > When this issue happened, we saw a lot of this exceptions happened before the > volumed failed on the datanode: > [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] > [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] > datanode.DataNode (BlockReceiver.java:(289)) - IOException in > BlockReceiver constructor :Possible disk error: Failed to create > /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause > is > java.io.IOException: No space left on device > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1012) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) > at java.lang.Thread.run(Thread.java:748) > > We found this issue happened due to the following two reasons: > First the upgrade process added some extra disk storage on the each disk > volume of the data node: > BlockPoolSliceStorage.doUpgrade > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445) > is the main upgrade function in the datanode, it will add some extra > storage. The extra storage added is all new directories created in > /current//current, although all block data file and block meta data > file are hard-linked with /current//previous after upgrade. Since there > will be a lot of new directories created, this will use some disk space on > each disk volume. > > Second there is a potential bug when picking a disk volume to write a new > block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, > The code to select a disk will check whether the available space on the > selected disk is more than the size bytes of block file to store > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86) > But when creating a new block, there will be two files created: one is the > block file blk_, the other is block metadata file blk__.meta, > this is the code when finalizing a block, both block file size and meta data > file size will be updated: >
[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanode.
[ https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihai Xu updated HDFS-16111: - Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanode. (was: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes.) > Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes > at datanode. > -- > > Key: HDFS-16111 > URL: https://issues.apache.org/jira/browse/HDFS-16111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Zhihai Xu >Assignee: Zhihai Xu >Priority: Major > > When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got > failed volume on a lot of datanodes, which cause some missing blocks at that > time. Although later on we recovered all the missing blocks by symlinking the > path (dfs/dn/current) on the failed volume to a new directory and copying all > the data to the new directory, we missed our SLA and it delayed our upgrading > process on our production cluster for several hours. > When this issue happened, we saw a lot of this exceptions happened before the > volumed failed on the datanode: > [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] > [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] > datanode.DataNode (BlockReceiver.java:(289)) - IOException in > BlockReceiver constructor :Possible disk error: Failed to create > /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause > is > java.io.IOException: No space left on device > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1012) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) > at java.lang.Thread.run(Thread.java:748) > > We found this issue happened due to the following two reasons: > First the upgrade process added some extra disk storage on the each disk > volume of the data node: > BlockPoolSliceStorage.doUpgrade > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445) > is the main upgrade function in the datanode, it will add some extra > storage. The extra storage added is all new directories created in > /current//current, although all block data file and block meta data > file are hard-linked with /current//previous after upgrade. Since there > will be a lot of new directories created, this will use some disk space on > each disk volume. > > Second there is a potential bug when picking a disk volume to write a new > block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, > The code to select a disk will check whether the available space on the > selected disk is more than the size bytes of block file to store > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86) > But when creating a new block, there will be two files created: one is the > block file blk_, the other is block metadata file blk__.meta, > this is the code when finalizing a block, both block file size and meta data > file size will be updated: >
[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes.
[ https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihai Xu updated HDFS-16111: - Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes. (was: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid picking an almost full volume to place a replica. ) > Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes. > -- > > Key: HDFS-16111 > URL: https://issues.apache.org/jira/browse/HDFS-16111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Zhihai Xu >Assignee: Zhihai Xu >Priority: Major > > When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got > failed volume on a lot of datanodes, which cause some missing blocks at that > time. Although later on we recovered all the missing blocks by symlinking the > path (dfs/dn/current) on the failed volume to a new directory and copying all > the data to the new directory, we missed our SLA and it delayed our upgrading > process on our production cluster for several hours. > When this issue happened, we saw a lot of this exceptions happened before the > volumed failed on the datanode: > [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] > [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] > datanode.DataNode (BlockReceiver.java:(289)) - IOException in > BlockReceiver constructor :Possible disk error: Failed to create > /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause > is > java.io.IOException: No space left on device > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1012) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) > at java.lang.Thread.run(Thread.java:748) > > We found this issue happened due to the following two reasons: > First the upgrade process added some extra disk storage on the each disk > volume of the data node: > BlockPoolSliceStorage.doUpgrade > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445) > is the main upgrade function in the datanode, it will add some extra > storage. The extra storage added is all new directories created in > /current//current, although all block data file and block meta data > file are hard-linked with /current//previous after upgrade. Since there > will be a lot of new directories created, this will use some disk space on > each disk volume. > > Second there is a potential bug when picking a disk volume to write a new > block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, > The code to select a disk will check whether the available space on the > selected disk is more than the size bytes of block file to store > (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86) > But when creating a new block, there will be two files created: one is the > block file blk_, the other is block metadata file blk__.meta, > this is the code when finalizing a block, both block file size and meta data > file size will be updated: >
[jira] [Created] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid picking an almost full volume to place a replica.
Zhihai Xu created HDFS-16111: Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid picking an almost full volume to place a replica. Key: HDFS-16111 URL: https://issues.apache.org/jira/browse/HDFS-16111 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Zhihai Xu Assignee: Zhihai Xu When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got failed volume on a lot of datanodes, which cause some missing blocks at that time. Although later on we recovered all the missing blocks by symlinking the path (dfs/dn/current) on the failed volume to a new directory and copying all the data to the new directory, we missed our SLA and it delayed our upgrading process on our production cluster for several hours. When this issue happened, we saw a lot of this exceptions happened before the volumed failed on the datanode: [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] datanode.DataNode (BlockReceiver.java:(289)) - IOException in BlockReceiver constructor :Possible disk error: Failed to create /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause is java.io.IOException: No space left on device at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:1012) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302) at org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) at java.lang.Thread.run(Thread.java:748) We found this issue happened due to the following two reasons: First the upgrade process added some extra disk storage on the each disk volume of the data node: BlockPoolSliceStorage.doUpgrade (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445) is the main upgrade function in the datanode, it will add some extra storage. The extra storage added is all new directories created in /current//current, although all block data file and block meta data file are hard-linked with /current//previous after upgrade. Since there will be a lot of new directories created, this will use some disk space on each disk volume. Second there is a potential bug when picking a disk volume to write a new block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, The code to select a disk will check whether the available space on the selected disk is more than the size bytes of block file to store (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86) But when creating a new block, there will be two files created: one is the block file blk_, the other is block metadata file blk__.meta, this is the code when finalizing a block, both block file size and meta data file size will be updated: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391 the current code only considers the size of block file and doesn't consider the size of block metadata file, when choosing a disk in RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks received at the same time, the default maximum number of DataXceiver threads is 4096. This will underestimate the total size needed to write a block, which will potentially cause the above disk full error(No space left on device). Since the size of the block metadata file is not
[jira] [Commented] (HDFS-16100) HA: Improve performance of Standby node transition to Active
[ https://issues.apache.org/jira/browse/HDFS-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374323#comment-17374323 ] Xiaoqiao He commented on HDFS-16100: Thanks [~ayushtkn] for your comments. IMO, it is safe to to queue when `storedBlock.getGenerationStamp() <= iblk.getGenerationStamp()` rather than `storedBlock.getGenerationStamp() == iblk.getGenerationStamp()` here. {code:java} + if (!(reportedState == ReplicaState.RBW && + storedBlock.getGenerationStamp() != iblk.getGenerationStamp())) { +.. + } {code} others look good to me. I will give my +1 when fix that. Thanks. > HA: Improve performance of Standby node transition to Active > - > > Key: HDFS-16100 > URL: https://issues.apache.org/jira/browse/HDFS-16100 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.1 >Reporter: wudeyu >Assignee: wudeyu >Priority: Major > Attachments: HDFS-16100.patch > > > pendingDNMessages in Standby is used to support process postponed block > reports. Block reports in pendingDNMessages would be processed: > # If GS of replica is in the future, Standby Node will process it when > corresponding edit log(e.g add_block) is loaded. > # If replica is corrupted, Standby Node will process it while it transfer to > Active. > # If DataNode is removed, corresponding of block reports will be removed in > pendingDNMessages. > Obviously, if num of corrupted replica grows, more time cost during > transferring. In out situation, there're 60 millions block reports in > pendingDNMessages before transfer. Processing block reports cost almost 7mins > and it's killed by zkfc. The replica state of the most block reports is RBW > with wrong GS(less than storedblock in Standby Node). > In my opinion, Standby Node could ignore the block reports that replica state > is RBW with wrong GS. Because Active node/DataNode will remove it later. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load
[ https://issues.apache.org/jira/browse/HDFS-16088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16088: --- Description: As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also request to SNN to reduce the ANN load. There are two points that need to be mentioned: 1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, so we can access SNN directly. 2. We can share the same UT(testBalancerRequestSBNWithHA) with NameNodeConnector#getBlocks(). was: As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also request to SNN to reduce the ANN load. There are two points that need to be mentioned: 1. FSNamesystem#getLiveDatanodeStorageReport() is OperationCategory.UNCHECKED, so we can access SNN directly. 2. We can share the same UT(testBalancerRequestSBNWithHA) with NameNodeConnector#getBlocks(). > Standby NameNode process getLiveDatanodeStorageReport request to reduce > Active load > --- > > Key: HDFS-16088 > URL: https://issues.apache.org/jira/browse/HDFS-16088 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: standyby-ipcserver.jpg > > Time Spent: 1h 50m > Remaining Estimate: 0h > > As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also > request to SNN to reduce the ANN load. > There are two points that need to be mentioned: > 1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, > so we can access SNN directly. > 2. We can share the same UT(testBalancerRequestSBNWithHA) with > NameNodeConnector#getBlocks(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16110: -- Labels: pull-request-available (was: ) > Remove unused method reportChecksumFailure in DFSClient > --- > > Key: HDFS-16110 > URL: https://issues.apache.org/jira/browse/HDFS-16110 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Remove unused method reportChecksumFailure and fix some code styles by the > way in DFSClient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-16110?focusedWorklogId=618433=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618433 ] ASF GitHub Bot logged work on HDFS-16110: - Author: ASF GitHub Bot Created on: 04/Jul/21 13:54 Start Date: 04/Jul/21 13:54 Worklog Time Spent: 10m Work Description: tomscut opened a new pull request #3174: URL: https://github.com/apache/hadoop/pull/3174 JIRA: [HDFS-16110](https://issues.apache.org/jira/browse/HDFS-16110) Remove unused method reportChecksumFailure and fix some code styles by the way in DFSClient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618433) Remaining Estimate: 0h Time Spent: 10m > Remove unused method reportChecksumFailure in DFSClient > --- > > Key: HDFS-16110 > URL: https://issues.apache.org/jira/browse/HDFS-16110 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Remove unused method reportChecksumFailure and fix some code styles by the > way in DFSClient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient
tomscut created HDFS-16110: -- Summary: Remove unused method reportChecksumFailure in DFSClient Key: HDFS-16110 URL: https://issues.apache.org/jira/browse/HDFS-16110 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Remove unused method reportChecksumFailure and fix some code styles by the way in DFSClient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16109) Fix flaky some unit tests since they offen timeout
[ https://issues.apache.org/jira/browse/HDFS-16109?focusedWorklogId=618418=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618418 ] ASF GitHub Bot logged work on HDFS-16109: - Author: ASF GitHub Bot Created on: 04/Jul/21 09:55 Start Date: 04/Jul/21 09:55 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3172: URL: https://github.com/apache/hadoop/pull/3172#issuecomment-873557085 Thanks @aajisaka and @ayushtkn for your review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 618418) Time Spent: 0.5h (was: 20m) > Fix flaky some unit tests since they offen timeout > -- > > Key: HDFS-16109 > URL: https://issues.apache.org/jira/browse/HDFS-16109 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Increase timeout for TestBootstrapStandby, TestFsVolumeList and > TestDecommissionWithBackoffMonitor since they offen timeout. > > TestBootstrapStandby: > {code:java} > [ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 159.474 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] Tests > run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< > FAILURE! - in > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] > testRateThrottling(org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby) > Time elapsed: 31.262 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 3 milliseconds at java.io.RandomAccessFile.writeBytes(Native Method) at > java.io.RandomAccessFile.write(RandomAccessFile.java:512) at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:947) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:910) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:699) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:642) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:387) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1224) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:795) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:673) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:760) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1014) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:989) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1763) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2261) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2231) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby.testRateThrottling(TestBootstrapStandby.java:297) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) > {code} > TestFsVolumeList: > {code:java} > [ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: > 190.294 s <<< FAILURE! - in >