Lohit is right. File creation will be slow if all 100,000 files are in one directory. Directory entries are implemented as a sorted array (ArrayList), which optimizes lookup (binary search) in the table, but makes entry insertion inefficient because it requires shifting all entries to the left of the insert point. This should be fixed at some point. For now if you don't mind the create performance (which is still negligible compared to big file writes) you can use large directories, otherwise (lots of small files) split them. --Konstantin
Lohit wrote:
Last when I tried to load an image with lots of files in same directory, it was like ten times slow. This is to do with the data structures. My numbers were million though. Try to have a directory structure. Lohit On Sep 17, 2008, at 11:57 AM, Nathan Marz <[EMAIL PROTECTED]> wrote: Hello all, Is it bad to have a lot of files in a single HDFS directory (aka, on the order of hundreds of thousands)? Or should we split our files into a directory structure of some sort? Thanks, Nathan Marz
