[
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694504#comment-14694504
]
Joep Rottinghuis commented on HDFS-8791:
----------------------------------------
Seems related to (or perhaps dup of) HADOOP-10434.
> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.6.0
> Reporter: Nathan Roberts
> Priority: Critical
>
> We are seeing cases where the new directory layout causes the datanode to
> basically cause the disks to seek for 10s of minutes. This can be when the
> datanode is running du, and it can also be when it is performing a
> checkDirs(). Both of these operations currently scan all directories in the
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks.
> inodes and dentries will be cached by linux and one can configure how likely
> the system is to prune those entries (vfs_cache_pressure). However, ext4
> relies on the buffer cache to cache the directory blocks and I'm not aware of
> any way to tell linux to favor buffer cache pages (even if it did I'm not
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume,
> this basically means the 64K directory blocks are probably randomly spread
> across the entire disk. A du type scan will look at directories one at a
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the
> seeks will be random and far.
> In a system I was using to diagnose this, I had 60K blocks. A DU when things
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer
> cache out, causing the next DU to hit this situation. We are seeing cases
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would
> be a few hundred directory blocks. Even when completely cold, these would
> only take a few a hundred seeks which would mean single digit seconds.
> - With only a few hundred directories, the odds of the directory blocks
> getting modified is quite high, this keeps those blocks hot and much less
> likely to be evicted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)