Hi
FileInputFormat sub-classes (TextInputFormat and
SequenceFileInputFormat) are able to take all files in a folder and
split the work of handling them into several sub-jobs (map-jobs). I know
it can split a very big file into several sub-jobs, but how does it
handle many small files in the folder. If there are 10000 small files
each with 100 datarecords, I would not like my sub-jobs to become too
small (due to the overhead of starting a JVM for each sub-job etc.). I
would like e.g. 100 sub-jobs each about handling 10000 datarecords, or
maybe 10 sub-jobs each about handling 100000 datarecords, but I would
not like 10000 sub-jobs each about handling 100 datarecords. For this to
be possible one split (the work to be done by one sub-job) will have to
span more than one file. My question is, if FileInputFormat sub-classes
are able to make such splits, or if they always create at least one
split=sub-job=map-job per file?
Another thing is: I expect that FileInputFormat has to somehow list the
files in the folder. Who does this listing handle many many files in the
folder. Most OS's are bad at listing files in folders when there are a
lot of files - at some point it become worse than O(n) where n is the
number of files. Windows of course really suck, and even linux has
problems with very high number of files. How does HDFS handle listing of
files in a folder with many many files? Or maybe I should address this
question to the hdfs mailing list?
Regards, Per Steffensen
- How does FileInputFormat sub-classes handle many small file... Per Steffensen
-