[
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027993#comment-15027993
]
Chris Trezzo commented on HDFS-8791:
------------------------------------
Thanks all for the comments.
[~cmccabe]
bq. how long did these upgrades take in the 0.5 million, 1.2 million, and 2.7
million block cases?
I have attached a [new
version|https://issues.apache.org/jira/secure/attachment/12774454/32x32DatanodeLayoutTesting-v2.pdf]
of the testing document with more details around the upgrade testing. The
upgrade for the above case was a setup with very low block density and the data
node upgraded with a startup time of 1 minute. I did do an upgrade test with a
data node that had around 2 million blocks in total. In that case the hard
linking alone took around 9 minutes.
[~andrew.wang] Agreed. I think it would be awesome if we could get this into
branch-2 and I am currently in the process of adding unit tests.
[~kihwal] I took a look at a node during upgrade, and it seemed like the
{{previous.tmp}} directory did indeed have the old layout like it should. Maybe
I am misunderstanding which directory you are looking at, so I will continue to
investigate.
As a side note: I am still early in my search, but I can't seem to find where
in a unit test we actually verify that the {{finalized}} directory does indeed
have the correct layout after an upgrade. The same goes for if the
{{previous.tmp}} directory actually has the old format during an upgrade that
isn't finalized yet. I see
{{TestDatanodeLayoutUpgrade#testUpgradeToIdBasedLayout}}, but a {{null}}
verifier is passed in to the {{upgradeAndVerify}} method. Additionally, all of
the other tests (i.e. TestDFSFinalize, TestRollingUpgradeRollback,
TestRollingUpgrade) seem to either be layout agnostic or simply check the
{{VERSION}} file. I will continue to investigate.
> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.6.0, 2.8.0, 2.7.1
> Reporter: Nathan Roberts
> Assignee: Chris Trezzo
> Priority: Critical
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf,
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch
>
>
> We are seeing cases where the new directory layout causes the datanode to
> basically cause the disks to seek for 10s of minutes. This can be when the
> datanode is running du, and it can also be when it is performing a
> checkDirs(). Both of these operations currently scan all directories in the
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks.
> inodes and dentries will be cached by linux and one can configure how likely
> the system is to prune those entries (vfs_cache_pressure). However, ext4
> relies on the buffer cache to cache the directory blocks and I'm not aware of
> any way to tell linux to favor buffer cache pages (even if it did I'm not
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume,
> this basically means the 64K directory blocks are probably randomly spread
> across the entire disk. A du type scan will look at directories one at a
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the
> seeks will be random and far.
> In a system I was using to diagnose this, I had 60K blocks. A DU when things
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer
> cache out, causing the next DU to hit this situation. We are seeing cases
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would
> be a few hundred directory blocks. Even when completely cold, these would
> only take a few a hundred seeks which would mean single digit seconds.
> - With only a few hundred directories, the odds of the directory blocks
> getting modified is quite high, this keeps those blocks hot and much less
> likely to be evicted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)