[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654234#comment-14654234
 ] 

Colin Patrick McCabe commented on HDFS-8791:
--------------------------------------------

The motivation behind the new layout was to eventually free the DataNode of the 
need to keep all block metadata in memory at all times.  Basically, we are 
entering a world where hard drive storage capacities double every year, but CPU 
and network increase at a relatively slower pace.  So keeping around 
information about every replica permanently paged into memory looks antiquated. 
 The new layout lets us avoid this by being able to find any block just based 
on its ID.  It is basically the equivalent of paged metadata, but for the DN.

We didn't think about the "du problem" when discussing the new layout.  It 
looks like HDFS ends up running a du on all of the replica files quite a lot.  
It's something we do after every I/O error, and also something we do on 
startup.  I think it's pretty silly that we run du after every I/O error-- we 
could certainly change that-- and the fact that it's not rate-limited is even 
worse.  We don't even confine the "du" to the drive where the I/O error 
occurred, but do it on every drive... I don't think anyone can give a good 
reason for that and it should certainly be changed as well.

The startup issue is more difficult to avoid.  If we have to do a "du" on all 
files during startup, then it could cause very long startup times if that 
involves a lot of seeks.  It seems like both the old and the new layout would 
have major problems with this scenario-- if you look out a year or two and 
multiply the current number of replicas by 8 or 16.

If we are going to bump layout version again we might want to consider 
something like keeping the replica metadata in leveldb.  This would avoid the 
need to do a "du" on startup and allow us to control our own caching.  It could 
also cut the number of ext4 files in half since we wouldn't need {{meta}} any 
more.

> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
>                 Key: HDFS-8791
>                 URL: https://issues.apache.org/jira/browse/HDFS-8791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Nathan Roberts
>            Priority: Critical
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to