[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

Kihwal Lee (JIRA) Mon, 30 Nov 2015 07:27:30 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15031954#comment-15031954
 ]


Kihwal Lee commented on HDFS-8791:
----------------------------------

This is what I saw on the upgraded node before it got finalized.  Before 
upgrade, {{current/finalized}} contained many sub directories.
{noformat}
-bash-4.1$ ls -l /xxx/data/current/BP-xxxxx/previous/finalized
total 4
drwxr-xr-x 115 hdfs users 4096 Nov 24 23:01 subdir0
{noformat}

This is what I saw in the log.
{noformat}
2015-11-24 23:06:09,980 INFO common.Storage: Upgrading block pool storage 
directory /xxx/data/current/BP-xxxxx.
   old LV = -56; old CTime = 0.
   new LV = -57; new CTime = 0
2015-11-24 23:06:11,625 INFO common.Storage: HardLinkStats: 116 Directories, 
including 3 Empty Directories, 57282 single
 Link operations, 0 multi-Link operations, linking 0 files, total 57282 
linkable files.  Also physically copied 0 other files.
2015-11-24 23:06:11,671 INFO common.Storage: Upgrade of block pool BP-xxxxx at 
/xxx/data/current/BP-xxxxx is complete
{noformat}

I just noticed the time stamp of {{subdir0}} is old, so empty directories were 
removed? I will test it again if that is the case. But I thought {{current}} 
eventually becomes {{previous}} after creating hard links, so even the empty 
dirs left intact.

> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
>                 Key: HDFS-8791
>                 URL: https://issues.apache.org/jira/browse/HDFS-8791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0, 2.8.0, 2.7.1
>            Reporter: Nathan Roberts
>            Assignee: Chris Trezzo
>            Priority: Critical
>         Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

Reply via email to