Hi

FileInputFormat sub-classes (TextInputFormat and SequenceFileInputFormat) are able to take all files in a folder and split the work of handling them into several sub-jobs (map-jobs). I know it can split a very big file into several sub-jobs, but how does it handle many small files in the folder. If there are 10000 small files each with 100 datarecords, I would not like my sub-jobs to become too small (due to the overhead of starting a JVM for each sub-job etc.). I would like e.g. 100 sub-jobs each about handling 10000 datarecords, or maybe 10 sub-jobs each about handling 100000 datarecords, but I would not like 10000 sub-jobs each about handling 100 datarecords. For this to be possible one split (the work to be done by one sub-job) will have to span more than one file. My question is, if FileInputFormat sub-classes are able to make such splits, or if they always create at least one split=sub-job=map-job per file?

Another thing is: I expect that FileInputFormat has to somehow list the files in the folder. Who does this listing handle many many files in the folder. Most OS's are bad at listing files in folders when there are a lot of files - at some point it become worse than O(n) where n is the number of files. Windows of course really suck, and even linux has problems with very high number of files. How does HDFS handle listing of files in a folder with many many files? Or maybe I should address this question to the hdfs mailing list?

Regards, Per Steffensen

Reply via email to