On Aug 17, 2010, at 5:44 PM, Stuart Smith wrote:
> I started to break the files into subdirectories out of habit (from working 
> on ntfs/etc), but it occurred to me that maybe (from a performance 
> perspective), it doesn't really matter on hdfs.
> 
> Does it? Is there some recommended limit on the number of files to store in 
> one directory on hdfs? I'm thinking thousands to millions, so we're not 
> talking about INT_MAX or anything, but a lot.
> 
> Or is it only limited by my sanity :) ?

We have a directory with several thousand files in it.

It is always a pain when we hit it because the client heap size needs to be 
increased to do anything in it:  directory listings, web uis, distcp, etc, etc, 
etc.  Doing any sort of manipulation in that dir is also slower.

My recommendation: don't do it.  Directories, AFAIK, are relatively cheap 
resource wise vs. lots of files in one.

[Hopefully these files are large.  Otherwise they should be joined together... 
if not, you're going to take a performance hit processing them *and* storing 
them...]

Reply via email to