How does FileInputFormat sub-classes handle many small files

Per Steffensen Thu, 01 Sep 2011 01:59:02 -0700

Hi

FileInputFormat sub-classes (TextInputFormat andSequenceFileInputFormat) are able to take all files in a folder andsplit the work of handling them into several sub-jobs (map-jobs). I knowit can split a very big file into several sub-jobs, but how does ithandle many small files in the folder. If there are 10000 small fileseach with 100 datarecords, I would not like my sub-jobs to become toosmall (due to the overhead of starting a JVM for each sub-job etc.). Iwould like e.g. 100 sub-jobs each about handling 10000 datarecords, ormaybe 10 sub-jobs each about handling 100000 datarecords, but I wouldnot like 10000 sub-jobs each about handling 100 datarecords. For this tobe possible one split (the work to be done by one sub-job) will have tospan more than one file. My question is, if FileInputFormat sub-classesare able to make such splits, or if they always create at least onesplit=sub-job=map-job per file?

Another thing is: I expect that FileInputFormat has to somehow list thefiles in the folder. Who does this listing handle many many files in thefolder. Most OS's are bad at listing files in folders when there are alot of files - at some point it become worse than O(n) where n is thenumber of files. Windows of course really suck, and even linux hasproblems with very high number of files. How does HDFS handle listing offiles in a folder with many many files? Or maybe I should address thisquestion to the hdfs mailing list?


Regards, Per Steffensen

How does FileInputFormat sub-classes handle many small files

Reply via email to