[
https://issues.apache.org/jira/browse/HDFS-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256742#comment-13256742
]
Colin Patrick McCabe commented on HDFS-3290:
--------------------------------------------
Hi Kihwal,
The data node currently keeps all of the block files for a single BlockPool in
a single directory. Unless you are using federation, this means that all of
the block files for a DataNode are in a single directory. This becomes
inefficient as the number of blocks grows.
The idea is to make a small, incremental change to the directory structure, so
that the block files are in multiple different directories rather than all in
the same directory. This is similar to how git works now.
{code}
cmccabe@keter:~/hadoop2> ls .git/objects/
00 09 12 1b 24 2d 36 3f 48 51 5a 63 6c 75 7e 87 90 9a a3 ac
b5 be c7 d0 d9 e2 eb f4 fd
01 0a 13 1c 25 2e 37 40 49 52 5b 64 6d 76 7f 88 92 9b a4 ad
b6 bf c8 d1 da e3 ec f5 fe
...
{code}
The subdirectories contain the object files:
{code}
cmccabe@keter:~/hadoop2> ls .git/objects/00
005d570a8ba44e314bb33db88499c0d385c66d 517f2a598d935eebac57453fb376c955184d72
fef1d2a78c2d30cede61a734b8fd1ae5c5f28f
41b3f30cb773267de5bb2a47169fae40ea65d7 68451b4b9acb9100fe78efc1d0b0283acc2024
471dd7fbdb1ccd6e7e079cd994b2920c5c93a8 fe8841111800e846bb961308224a826c33971c
{code}
In contrast, the DataNode puts everything in the same directory:
{code}
cmccabe@keter:~/hadoop1> ls
/opt/hadoop/run4/data1/current/BP-1579759935-127.0.0.1-1333677135630/current/finalized/
blk_2787038401297968504 blk_4287797753246219082
blk_-7903322996600832353
blk_2787038401297968504_1013.meta blk_4287797753246219082_1005.meta
blk_-7903322996600832353_1003.meta
blk_3011630105542771325 blk_-5154897827824037676
blk_-8206000470100252669
blk_3011630105542771325_1007.meta blk_-5154897827824037676_1017.meta
blk_-8206000470100252669_1009.meta
blk_3119417112012450397 blk_-6449276351298923965
blk_3119417112012450397_1015.meta blk_-6449276351298923965_1011.meta
{code}
P.S. Yes, I am aware of the rbw directory. However, most of the blocks are
not going to be in that directory.
cheers,
Colin
> Use a better local directory layout for the datanode
> ----------------------------------------------------
>
> Key: HDFS-3290
> URL: https://issues.apache.org/jira/browse/HDFS-3290
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 0.23.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Priority: Minor
>
> When the HDFS DataNode stores chunks in a local directory, it currently puts
> all of the chunk files into one big directory. As the number of files
> increases, this does not work well at all. Local filesystems are not
> optimized for the case where there are hundreds of thousands of files in the
> same directory. It also makes inspecting directories with standard UNIX
> tools difficult.
> Similar to the git version control system, HDFS should create a few different
> top level directories keyed off of a few bits in the chunk ID. Git uses 8
> bits. This substantially cuts down on the number of chunk files in the same
> directory and gives increased performance.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira