[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=628867=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-628867
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 28/Jul/21 04:59
Start Date: 28/Jul/21 04:59
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#issuecomment-888010597


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m 25s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 33s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 47s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m 20s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 48s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 58s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 18s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 38s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 45s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  the patch passed  |
   | +1 :green_heart: |  xml  |   0m  2s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  javadoc  |   1m  5s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   4m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 58s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 418m  8s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 529m 45s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | 
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
   |   | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
   |   | hadoop.hdfs.web.TestWebHdfsFileSystemContract |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3175 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell xml |
   | uname | Linux 6fa4feae10d8 4.15.0-142-generic #146-Ubuntu SMP Tue Apr 13 
01:11:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 24a19b0d09670448c2e1326ee4d743cb8d84cba1 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=628832=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-628832
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 28/Jul/21 02:19
Start Date: 28/Jul/21 02:19
Worklog Time Spent: 10m 
  Work Description: jojochuang merged pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 628832)
Time Spent: 2h 40m  (was: 2.5h)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, 
> The code to select a disk will check whether the 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=628374=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-628374
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 27/Jul/21 11:26
Start Date: 27/Jul/21 11:26
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#issuecomment-887432654


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  30m 40s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  4s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 25s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 12s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m 21s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | -1 :x: |  javac  |   1m 14s | 
[/results-compile-javac-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/3/artifact/out/results-compile-javac-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt)
 |  hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 generated 1 new + 467 unchanged 
- 1 fixed = 468 total (was 468)  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | -1 :x: |  javac  |   1m  9s | 
[/results-compile-javac-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/3/artifact/out/results-compile-javac-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt)
 |  
hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
 with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 generated 1 new 
+ 451 unchanged - 1 fixed = 452 total (was 452)  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 54s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 338 unchanged 
- 0 fixed = 341 total (was 338)  |
   | +1 :green_heart: |  mvnsite  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  xml  |   0m  1s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  16m 13s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 237m  2s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 45s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 321m 31s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=628175=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-628175
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 27/Jul/21 06:07
Start Date: 27/Jul/21 06:07
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 commented on pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#issuecomment-887237138


   Updated the patch to fix the above test failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 628175)
Time Spent: 2h 20m  (was: 2h 10m)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627591=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627591
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 26/Jul/21 08:53
Start Date: 26/Jul/21 08:53
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on a change in pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r676415640



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   ok 1GB sounds fine to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 627591)
Time Spent: 2h 10m  (was: 2h)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous after upgrade. Since there 
> 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627571=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627571
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 26/Jul/21 07:58
Start Date: 26/Jul/21 07:58
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#issuecomment-886467804


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 44s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 10s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m 14s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 53s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  9s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   4m  3s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  18m  9s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 51s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 199 unchanged 
- 0 fixed = 201 total (was 199)  |
   | +1 :green_heart: |  mvnsite  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  xml  |   0m  1s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m  9s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  15m 52s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 243m  7s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 46s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 338m 42s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand |
   |   | hadoop.hdfs.server.balancer.TestBalancer |
   |   | hadoop.hdfs.TestMiniDFSCluster |
   |   | hadoop.hdfs.server.namenode.TestStripedINodeFile |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistFiles |
   |   | hadoop.hdfs.server.mover.TestMover |
   |   | 
hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement |
   |   | 
hadoop.hdfs.server.namenode.sps.TestStoragePolicySatisfierWithStripedFile |
   |   | hadoop.hdfs.server.balancer.TestBalancerLongRunningTasks |
   |   | hadoop.hdfs.server.diskbalancer.TestDiskBalancer |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyWriter |
   |   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
   
   
   | Subsystem | Report/Notes |
   

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627498=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627498
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 26/Jul/21 02:33
Start Date: 26/Jul/21 02:33
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 commented on a change in pull request 
#3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r676253601



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   I just updated the patch with the default value 1GB.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 627498)
Time Spent: 1h 50m  (was: 1h 40m)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627344=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627344
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 23/Jul/21 23:38
Start Date: 23/Jul/21 23:38
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 commented on a change in pull request 
#3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r675911485



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   @ywskycn thanks for the review! Yes, Good Suggestion, I think 1GB is a 
good default number in this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 627344)
Time Spent: 1h 40m  (was: 1.5h)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627328=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627328
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 23/Jul/21 22:28
Start Date: 23/Jul/21 22:28
Worklog Time Spent: 10m 
  Work Description: ywskycn commented on a change in pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r675894217



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   4MB looks too small here, given that DataXceiver threads count is 
configurable. How about 1GB? @zhihaixu2012, @jojochuang 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 627328)
Time Spent: 1.5h  (was: 1h 20m)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627317=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627317
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 23/Jul/21 21:26
Start Date: 23/Jul/21 21:26
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 commented on a change in pull request 
#3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r675873861



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   No, It won't have issue. 10 GB is good for practical usage. But I am 
afraid It may have some issues if people test Hadoop in local mode with less 
than 10GB available space on the local disks without changing this 
configuration value. Either 10 GB or 4 MB works for me. thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 627317)
Time Spent: 1h 20m  (was: 1h 10m)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627298=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627298
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 23/Jul/21 19:41
Start Date: 23/Jul/21 19:41
Worklog Time Spent: 10m 
  Work Description: ywskycn commented on a change in pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r675827307



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   So if we pick 10 GB or another larger value, it may cause some issues 
for some overloaded DNs when applying this patch, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 627298)
Time Spent: 1h 10m  (was: 1h)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=627293=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-627293
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 23/Jul/21 19:18
Start Date: 23/Jul/21 19:18
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 commented on a change in pull request 
#3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r670804904



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   Yes, Good suggestion. I think the reasonably default number can be 10 GB 
, which is the same as the default value of 
dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold. 
since AvailableSpaceVolumeChoosingPolicy uses this default value as balanced 
disk maximum difference. But on the other hand for better backward 
compatibility, I think the default value need be more conservative. The default 
maximum number of DataXceiver threads is 4096 and the average file size for a 
metadata file should be less than 1K  bytes. 4096 multiplied by 1k is 4 MB. I 
think the better default number may be 4 MB for backward compatibility. What is 
your suggestion? thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 627293)
Time Spent: 1h  (was: 50m)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=623307=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-623307
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 15/Jul/21 21:12
Start Date: 15/Jul/21 21:12
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 commented on a change in pull request 
#3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r670804904



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   Yes, Good suggestion. I think the reasonably good default number I want 
to set is 10 GB , which is the same as the default value of 
dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold. 
since AvailableSpaceVolumeChoosingPolicy uses this default value as balanced 
disk maximum difference. I think this value can safeguard the disk space and 
also not waste much disk storage. What is your suggestion? thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 623307)
Time Spent: 50m  (was: 40m)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=623304=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-623304
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 15/Jul/21 21:05
Start Date: 15/Jul/21 21:05
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 commented on a change in pull request 
#3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r670804904



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   Yes, Good suggestion. I think the reasonably good default number I want 
to set is 10 GB , which is the same as the default value of 
dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold. 
since AvailableSpaceVolumeChoosingPolicy uses this default value as balanced 
disk maximum difference. I think this value can safeguard the disk space and 
also not waste much disk storage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 623304)
Time Spent: 40m  (was: 0.5h)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=622967=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-622967
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 15/Jul/21 10:41
Start Date: 15/Jul/21 10:41
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on a change in pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#discussion_r670347032



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -2657,6 +2657,17 @@
   
 
 
+
+  
dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
+  0

Review comment:
   Do you have a good default number in mind? IMO, if this configuration 
potentially saves a big problem (disk full) we should consider picking a 
reasonably good default number if it's not good big.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 622967)
Time Spent: 0.5h  (was: 20m)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=618544=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618544
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 05/Jul/21 06:31
Start Date: 05/Jul/21 06:31
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#issuecomment-873841548


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 51s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  30m 57s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 17s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  6s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 24s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m  8s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m  2s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  8s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m  8s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 57s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 6 new + 461 unchanged 
- 0 fixed = 467 total (was 461)  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  xml  |   0m  2s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m  8s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  16m  5s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 388m 19s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 46s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 472m 49s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS |
   |   | hadoop.hdfs.web.TestWebHdfsFileSystemContract |
   |   | 
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
   |   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
   |   | 
hadoop.fs.viewfs.TestViewFileSystemOverloadSchemeHdfsFileSystemContract |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3175 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 

[jira] [Work logged] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=618459=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618459
 ]

ASF GitHub Bot logged work on HDFS-16111:
-

Author: ASF GitHub Bot
Created on: 04/Jul/21 22:36
Start Date: 04/Jul/21 22:36
Worklog Time Spent: 10m 
  Work Description: zhihaixu2012 opened a new pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175


   …avoid failed volumes at datanodes.
   
   Change-Id: Iead25812d4073e3980893e3e76f7d2b03b57442a
   
   JIRA: https://issues.apache.org/jira/browse/HDFS-16111
   
   there is a potential bug when picking a disk volume to write a new block 
file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, The code 
to select a disk will check whether the available space on the selected disk is 
more than the size bytes of block file to store 
(https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
 But when creating a new block, there will be two files created: one is the 
block file blk_, the other is block metadata file blk__.meta, this 
is the code when finalizing a block, both block file size and meta data file 
size will be updated: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391
 the current code only considers the size of block file and doesn't consider 
the size of block metadata file, when choosing a disk in 
RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks received 
at the same time, the default maximum number of DataXceiver threads is 4096. 
This will underestimate the total size needed to write a block, which will 
potentially cause the disk full error(No space left on device) when writing a 
replica.
   
   Since the size of the block metadata file is not fixed, I suggest to add a 
configuration(dfs.datanode.round-robin-volume-choosing-policy.additional-available-space)
 to safeguard the disk space when choosing a volume to write a new block data 
in RoundRobinVolumeChoosingPolicy.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 618459)
Remaining Estimate: 0h
Time Spent: 10m

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
>