Hi Enis & Hadoopers,
thanks for the hint. I created/modified my RecordReader so that it uses MultiFileInputSplit and reads 30 files at once (by spawning several threads and using a bounded buffer àla producer/consumer). The accumulated throughput is now about 1MB/s on my 30 MB test data (spread over 300 files). However, I noticed some other bottlenecks during job submissions - a job submission of 53.000 files spread over 18,150 folders takes about 1hr and 45 mins.. Since all the files are spread over severals thousand directories - listing/traversing of those directories using the listpath / globpaths method generates several thousands RPC calls. I think it would be more efficient to send the regex/path expression (the parameters) of the globpaths method to the server and traversing the directory tree on the server side instead of client side, or is there another way to retrieve all the file paths? Also, for each of my thousand files, a getBlockLocation RPC call is/was generated - I implemented/added a getBlockLocations[] method that accepts an array of paths etc. and returns a String[][][] matrix instead which is much more very efficient then generating thousands of RPC calls when calling getBlockLocation in the MultiFileSplit class...
Any thoughts/comments are much appreciated!
Thanks in advance!

Cu on the 'net,
                      Bye - bye,

                                 <<<<< André <<<< >>>> èrbnA >>>>>

Enis Soztutar wrote:
Hi,

I think you should try using MultiFileInputFormat/MultiFileInputSplit rather than FileSplit, since the former is optimized for processing large number of files. Could you report you numMaps and numReduces and the avarage time the map() function is expected to take.

Reply via email to