[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027993#comment-15027993
 ] 

Chris Trezzo commented on HDFS-8791:
------------------------------------

Thanks all for the comments.

[~cmccabe]
bq. how long did these upgrades take in the 0.5 million, 1.2 million, and 2.7 
million block cases?
I have attached a [new 
version|https://issues.apache.org/jira/secure/attachment/12774454/32x32DatanodeLayoutTesting-v2.pdf]
 of the testing document with more details around the upgrade testing. The 
upgrade for the above case was a setup with very low block density and the data 
node upgraded with a startup time of 1 minute. I did do an upgrade test with a 
data node that had around 2 million blocks in total. In that case the hard 
linking alone took around 9 minutes.

[~andrew.wang] Agreed. I think it would be awesome if we could get this into 
branch-2 and I am currently in the process of adding unit tests.

[~kihwal] I took a look at a node during upgrade, and it seemed like the 
{{previous.tmp}} directory did indeed have the old layout like it should. Maybe 
I am misunderstanding which directory you are looking at, so I will continue to 
investigate.

As a side note: I am still early in my search, but I can't seem to find where 
in a unit test we actually verify that the {{finalized}} directory does indeed 
have the correct layout after an upgrade. The same goes for if the 
{{previous.tmp}} directory actually has the old format during an upgrade that 
isn't finalized yet. I see 
{{TestDatanodeLayoutUpgrade#testUpgradeToIdBasedLayout}}, but a {{null}} 
verifier is passed in to the {{upgradeAndVerify}} method. Additionally, all of 
the other tests (i.e. TestDFSFinalize, TestRollingUpgradeRollback, 
TestRollingUpgrade) seem to either be layout agnostic or simply check the 
{{VERSION}} file. I will continue to investigate.

> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
>                 Key: HDFS-8791
>                 URL: https://issues.apache.org/jira/browse/HDFS-8791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0, 2.8.0, 2.7.1
>            Reporter: Nathan Roberts
>            Assignee: Chris Trezzo
>            Priority: Critical
>         Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to