[
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15031954#comment-15031954
]
Kihwal Lee commented on HDFS-8791:
----------------------------------
This is what I saw on the upgraded node before it got finalized. Before
upgrade, {{current/finalized}} contained many sub directories.
{noformat}
-bash-4.1$ ls -l /xxx/data/current/BP-xxxxx/previous/finalized
total 4
drwxr-xr-x 115 hdfs users 4096 Nov 24 23:01 subdir0
{noformat}
This is what I saw in the log.
{noformat}
2015-11-24 23:06:09,980 INFO common.Storage: Upgrading block pool storage
directory /xxx/data/current/BP-xxxxx.
old LV = -56; old CTime = 0.
new LV = -57; new CTime = 0
2015-11-24 23:06:11,625 INFO common.Storage: HardLinkStats: 116 Directories,
including 3 Empty Directories, 57282 single
Link operations, 0 multi-Link operations, linking 0 files, total 57282
linkable files. Also physically copied 0 other files.
2015-11-24 23:06:11,671 INFO common.Storage: Upgrade of block pool BP-xxxxx at
/xxx/data/current/BP-xxxxx is complete
{noformat}
I just noticed the time stamp of {{subdir0}} is old, so empty directories were
removed? I will test it again if that is the case. But I thought {{current}}
eventually becomes {{previous}} after creating hard links, so even the empty
dirs left intact.
> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.6.0, 2.8.0, 2.7.1
> Reporter: Nathan Roberts
> Assignee: Chris Trezzo
> Priority: Critical
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf,
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch
>
>
> We are seeing cases where the new directory layout causes the datanode to
> basically cause the disks to seek for 10s of minutes. This can be when the
> datanode is running du, and it can also be when it is performing a
> checkDirs(). Both of these operations currently scan all directories in the
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks.
> inodes and dentries will be cached by linux and one can configure how likely
> the system is to prune those entries (vfs_cache_pressure). However, ext4
> relies on the buffer cache to cache the directory blocks and I'm not aware of
> any way to tell linux to favor buffer cache pages (even if it did I'm not
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume,
> this basically means the 64K directory blocks are probably randomly spread
> across the entire disk. A du type scan will look at directories one at a
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the
> seeks will be random and far.
> In a system I was using to diagnose this, I had 60K blocks. A DU when things
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer
> cache out, causing the next DU to hit this situation. We are seeing cases
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would
> be a few hundred directory blocks. Even when completely cold, these would
> only take a few a hundred seeks which would mean single digit seconds.
> - With only a few hundred directories, the odds of the directory blocks
> getting modified is quite high, this keeps those blocks hot and much less
> likely to be evicted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)