Hello Per, On Thu, Sep 1, 2011 at 2:27 PM, Per Steffensen <st...@designware.dk> wrote: > Hi > > FileInputFormat sub-classes (TextInputFormat and SequenceFileInputFormat) > are able to take all files in a folder and split the work of handling them > into several sub-jobs (map-jobs). I know it can split a very big file into > several sub-jobs, but how does it handle many small files in the folder. If > there are 10000 small files each with 100 datarecords, I would not like my > sub-jobs to become too small (due to the overhead of starting a JVM for each > sub-job etc.). I would like e.g. 100 sub-jobs each about handling 10000 > datarecords, or maybe 10 sub-jobs each about handling 100000 datarecords, > but I would not like 10000 sub-jobs each about handling 100 datarecords. For > this to be possible one split (the work to be done by one sub-job) will have > to span more than one file. My question is, if FileInputFormat sub-classes > are able to make such splits, or if they always create at least one > split=sub-job=map-job per file?
You need CombineFileInputFormat for your case, not the vanilla FileInputFormat which gives at least one split per file. > Another thing is: I expect that FileInputFormat has to somehow list the > files in the folder. Who does this listing handle many many files in the > folder. Most OS's are bad at listing files in folders when there are a lot > of files - at some point it become worse than O(n) where n is the number of > files. Windows of course really suck, and even linux has problems with very > high number of files. How does HDFS handle listing of files in a folder with > many many files? Or maybe I should address this question to the hdfs mailing > list? Listing would be one RPC call in HDFS, so it might take some time for millions of files under a single directory. Although the listing operation, since done in-memory of the NameNode would be fast enough, the transfer of the results to the client for large amount of items may take up some time -- and there's no way to page either. I do not think the complexity is worse than O(n) though, for files under the same dir - only transfer costs should be your worry. But I've not measured these things to give you concrete statements on this. Might be a good exercise? Also know that listing is only done on the front end (job client submission) and not later on. So it is just a one time cost. -- Harsh J