[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225682#comment-15225682
 ] 

Andrew Wang commented on HDFS-8791:
-----------------------------------

bq. This is really unfortunate. Can you give a reference to the NameNode 
LayoutVersion change?

[~vinodkv] we made a few changes in 2.6 -> 2.7, you can look at 
NameNodeLayoutVersion.java for the short summary. Most of them are adding new 
edit log ops, but truncate is a bigger one.

bq. Did we ever establish clear rules about downgrades? We need to layout out 
our story around supporting downgrades continuously and codify it.

We've never addressed downgrade in our compatibility policy, so officially we 
don't have anything.

bq. I'd vote for keeping strict rules for downgrades too, otherwise users are 
left to fend for themselves in deciding the risk associated with every version 
upgrade - are we in a place where we can support this?

We aren't yet, unless we want to exclude most larger features from branch-2. 
Long ago HDFS-5223 was filed to add feature flags to HDFS, which would enable 
downgrade in more scenarios. We never reached consensus though, and it stalled 
out since people seemed to like rolling upgrade (HDFS-5535) more. It can be 
revived, but it does mean additional testing complexity though, and new 
features need to be developed with feature flags in mind.

I like this sentiment in general though, and would be in favor of requiring 
upgrade and downgrade within a major version, once we have HDFS-5223 in. It can 
be added compatibly to branch-2 since Colin did the groundwork in HDFS-5784.

bq. To conclude, is the consensus to document all these downgrade related 
breakages but keep them in 2.7.x and 2.8?

Unless we have new energy to pursue HDFS-5223, I think we'll keep doing LV 
changes. We've been doing them almost every 2.x release, so I think user 
expectations should be in line. AFAIK we've never announced a change in our 
support for downgrade in the 2.x line.

It's also worth noting that HDFS-5223 was thinking about NN LV changes, it 
predates the NN and DN LV split. Thus I'm not sure the feature flags work at 
HDFS-5784 would work for this particular JIRA.

> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>
>                 Key: HDFS-8791
>                 URL: https://issues.apache.org/jira/browse/HDFS-8791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0, 2.8.0, 2.7.1
>            Reporter: Nathan Roberts
>            Assignee: Chris Trezzo
>            Priority: Blocker
>             Fix For: 2.7.3
>
>         Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v3-bin.patch, 
> hadoop-56-layout-datanode-dir.tgz, test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to