[
https://issues.apache.org/jira/browse/HDFS-16111?focusedWorklogId=618544&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618544
]
ASF GitHub Bot logged work on HDFS-16111:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 05/Jul/21 06:31
Start Date: 05/Jul/21 06:31
Worklog Time Spent: 10m
Work Description: hadoop-yetus commented on pull request #3175:
URL: https://github.com/apache/hadoop/pull/3175#issuecomment-873841548
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 51s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 1 new or modified test files. |
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 30m 57s | | trunk passed |
| +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 |
| +1 :green_heart: | compile | 1m 17s | | trunk passed with JDK
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
| +1 :green_heart: | checkstyle | 1m 6s | | trunk passed |
| +1 :green_heart: | mvnsite | 1m 24s | | trunk passed |
| +1 :green_heart: | javadoc | 0m 57s | | trunk passed with JDK
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 |
| +1 :green_heart: | javadoc | 1m 28s | | trunk passed with JDK
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
| +1 :green_heart: | spotbugs | 3m 8s | | trunk passed |
| +1 :green_heart: | shadedclient | 16m 2s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 1m 14s | | the patch passed |
| +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 |
| +1 :green_heart: | javac | 1m 12s | | the patch passed |
| +1 :green_heart: | compile | 1m 8s | | the patch passed with JDK
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
| +1 :green_heart: | javac | 1m 8s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| -0 :warning: | checkstyle | 0m 57s |
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs-project/hadoop-hdfs: The patch generated 6 new + 461 unchanged
- 0 fixed = 467 total (was 461) |
| +1 :green_heart: | mvnsite | 1m 12s | | the patch passed |
| +1 :green_heart: | xml | 0m 2s | | The patch has no ill-formed XML
file. |
| +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 |
| +1 :green_heart: | javadoc | 1m 23s | | the patch passed with JDK
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
| +1 :green_heart: | spotbugs | 3m 8s | | the patch passed |
| +1 :green_heart: | shadedclient | 16m 5s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| -1 :x: | unit | 388m 19s |
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch passed. |
| +1 :green_heart: | asflicense | 0m 46s | | The patch does not
generate ASF License warnings. |
| | | 472m 49s | | |
| Reason | Tests |
|-------:|:------|
| Failed junit tests |
hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS |
| | hadoop.hdfs.web.TestWebHdfsFileSystemContract |
| |
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
| | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
| | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
| | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
| |
hadoop.fs.viewfs.TestViewFileSystemOverloadSchemeHdfsFileSystemContract |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.41 ServerAPI=1.41 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/3175 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell xml |
| uname | Linux 8b4e5220db5c 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / c00533cd419136ff13cf5d039306ffaccd33fe4a |
| Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
| Multi-JDK versions |
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/testReport/ |
| Max. process+thread count | 2852 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U:
hadoop-hdfs-project/hadoop-hdfs |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3175/1/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 618544)
Time Spent: 20m (was: 10m)
> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes
> at datanodes.
> -------------------------------------------------------------------------------------------
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Zhihai Xu
> Assignee: Zhihai Xu
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got
> failed volume on a lot of datanodes, which cause some missing blocks at that
> time. Although later on we recovered all the missing blocks by symlinking the
> path (dfs/dn/current) on the failed volume to a new directory and copying all
> the data to the new directory, we missed our SLA and it delayed our upgrading
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the
> volumed failed on the datanode:
> [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/]
> [Receiving block BP-XXXXXX-XX.XX.XX.XX-XXXXXX:blk_XXXXX_XXXXXXX]]
> datanode.DataNode (BlockReceiver.java:<init>(289)) - IOException in
> BlockReceiver constructor :Possible disk error: Failed to create
> /XXXXXXX/dfs/dn/current/BP-XXXXXX-XX.XX.XX.XX-XXXXXXXXX/tmp/blk_XXXXXX. Cause
> is
> java.io.IOException: No space left on device
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1012)
> at
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
> at
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
> at java.lang.Thread.run(Thread.java:748)
>
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
> is the main upgrade function in the datanode, it will add some extra
> storage. The extra storage added is all new directories created in
> /current/<bpid>/current, although all block data file and block meta data
> file are hard-linked with /current/<bpid>/previous after upgrade. Since there
> will be a lot of new directories created, this will use some disk space on
> each disk volume.
>
> Second there is a potential bug when picking a disk volume to write a new
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy,
> The code to select a disk will check whether the available space on the
> selected disk is more than the size bytes of block file to store
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
> But when creating a new block, there will be two files created: one is the
> block file blk_XXXX, the other is block metadata file blk_XXXX_XXXX.meta,
> this is the code when finalizing a block, both block file size and meta data
> file size will be updated:
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391
> the current code only considers the size of block file and doesn't consider
> the size of block metadata file, when choosing a disk in
> RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks
> received at the same time, the default maximum number of DataXceiver threads
> is 4096. This will underestimate the total size needed to write a block,
> which will potentially cause the above disk full error(No space left on
> device).
>
> Since the size of the block metadata file is not fixed, I suggest to add a
> configuration(
> dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
> ) to safeguard the disk space when choosing a volume to write a new block
> data in RoundRobinVolumeChoosingPolicy.
> The default value can be 0 for backward compatibility.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]