[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635615#comment-14635615
 ] 

Colin Patrick McCabe commented on HDFS-8791:
--------------------------------------------

bq. I think ext2 and ext3 will see a similar problem. Are you seeing something 
different? I'll admit that my understanding of the differences isn't 
exhaustive, but it sure seems like all of them rely on the buffer cache to 
maintain directory blocks and all of them try to spread directories across the 
disk, so they'd all be subject to the same sort of thing.

ext2 is more or less extinct in production, at least for us.  ext3 is still in 
use on some older clusters, but it has known performance issues compared with 
ext4, so we're trying to phase it out as well.  We haven't seen the very long 
startup times you're describing, although the back-of-the-envelope math related 
to disk seeks during startup is concerning.

bq. I forgot to mention that I'm pretty confident it's not the inodes, but 
rather the directory blocks. inodes have their own cache that I can control 
with vfs_cache_pressure. directory blocks however are just cached via the 
buffer cache (afaik), and the buffer cache is much more difficult to have any 
control over.

I'm having trouble understanding these kernel settings. 
http://www.gluster.org/community/documentation/index.php/Linux_Kernel_Tuning 
says that  "When vfs_cache_pressure=0, the kernel will never reclaim dentries 
and inodes due to memory pressure and this can easily lead to out-of-memory 
conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to 
prefer to reclaim dentries and inodes."  So that would seem to indicate that 
vfs_cache_pressure does have control over dentries (i.e. the "directory blocks" 
which contain the list of child inodes).  What settings have you used for 
{{vfs_cache_pressure}} so far?

bq. I'm wondering if we shouldn't move to a hashing scheme that is more dynamic 
and grows/shrinks based on the number of blocks in the volume...

The problem that we have is that we have a tension between two things:
* If directories get too big, the readdir() needed to find the genstamp file of 
each block file gets very expensive.
* If directories get too small, they tend to drop out of the cache since they 
are rarely accessed.

I think if we're going to change the on-disk layout format again, we should 
change the way we name meta files.  Currently, we encode the genstamp in the 
file name, like {{blk_1073741915_1091.meta}}.  This means that to look up the 
meta file for block {{1073741915}}, we have to iterate through every file in 
the subdirectory until we find it.  Instead, we could simply name the meta file 
as {{blk_107374191.meta}} and put the genstamp number in the meta file header.  
This would allow us to move to a scheme which had a very large number of blocks 
in each directory (perhaps a simple 1-level hashing scheme) and the dentries 
would always be "hot".  ext4 and other modern Linux filesystems deal very 
effectively with large directories-- it's only ext2 and ext3 without certain 
options enabled that had problems.

Since a layout version change is such a heavy hammer, though, I wonder if 
there's some simple tweak we can make that will avoid this issue.  Have you 
tried using xfs instead of ext4?  Perhaps it handles caching differently.  I 
think at some point we should pull out systemtap or LTTng and really find out 
what specifically is falling out of the cache and why.

> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
>                 Key: HDFS-8791
>                 URL: https://issues.apache.org/jira/browse/HDFS-8791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Nathan Roberts
>            Priority: Critical
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to