Hi Enis & Hadoopers,
thanks for the hint. I created/modified my RecordReader so that it uses
MultiFileInputSplit and reads 30 files at once (by spawning several
threads and using a bounded buffer àla producer/consumer). The
accumulated throughput is now about 1MB/s on my 30 MB test data (spread
over 300 files).
However, I noticed some other bottlenecks during job submissions - a job
submission of 53.000 files spread over 18,150 folders takes about 1hr
and 45 mins..
Since all the files are spread over severals thousand directories -
listing/traversing of those directories using the listpath / globpaths
method generates several thousands RPC calls. I think it would be more
efficient to send the regex/path expression (the parameters) of the
globpaths method to the server and traversing the directory tree on the
server side instead of client side, or is there another way to retrieve
all the file paths?
Also, for each of my thousand files, a getBlockLocation RPC call is/was
generated - I implemented/added a getBlockLocations[] method that
accepts an array of paths etc. and returns a String[][][] matrix instead
which is much more very efficient then generating thousands of RPC calls
when calling getBlockLocation in the MultiFileSplit class...
Any thoughts/comments are much appreciated!
Thanks in advance!
Cu on the 'net,
Bye - bye,
<<<<< André <<<< >>>> èrbnA >>>>>
Enis Soztutar wrote:
Hi,
I think you should try using MultiFileInputFormat/MultiFileInputSplit
rather than FileSplit, since the former is optimized for processing
large number of files. Could you report you numMaps and numReduces and
the avarage time the map() function is expected to take.