Hi, Is there any function/script to do that in Hadoop ? thanks On 11/4/07, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > If your larger run is typical of your smaller run then you have lots and > lots of small files. This is going to make things slow even without the > overhead of a distributed computation. > > In the sequential case, enumerating the files an inefficient read patterns > will be what slows you down. The inefficient reads come about because the > disk has to seek every 100KB of input. That is bad. > > In the hadoop case, things are worse because opening a file takes much > longer than with local files. > > The solution is for you to package your data more efficiently. This fixes > a > multitude of ills. If you don't mind limiting your available parallelism > a > little bit, you could even use tar files (tar isn't usually recommended > because you can't split a tar file across maps). > > If you were to package 1000 files per bundle, you would get average file > sizes of 100MB instead of 100KB and your file opening overhead in the > parallel case would be decreased by 1000x. Your disk read speed would be > much higher as well because your disks would mostly be reading contiguous > sectors. > > I have a system similar to yours with lots and lots of little files > (littler > than yours even). With aggressive file bundling I can routinely process > data at a sustained rate of 100MB/s on ten really crummy storage/compute > nodes. Moreover, that rate is probably not even bounded by I/O since my > data takes a fair bit of CPU to decrypt and parse. > > > On 11/4/07 4:02 PM, "André Martin" <[EMAIL PROTECTED]> wrote: > > > Hi Enis & Hadoopers, > > thanks for the hint. I created/modified my RecordReader so that it uses > > MultiFileInputSplit and reads 30 files at once (by spawning several > > threads and using a bounded buffer àla producer/consumer). The > > accumulated throughput is now about 1MB/s on my 30 MB test data (spread > > over 300 files). > > However, I noticed some other bottlenecks during job submissions - a job > > submission of 53.000 files spread over 18,150 folders takes about 1hr > > and 45 mins.. > > Since all the files are spread over severals thousand directories - > > listing/traversing of those directories using the listpath / globpaths > > method generates several thousands RPC calls. I think it would be more > > efficient to send the regex/path expression (the parameters) of the > > globpaths method to the server and traversing the directory tree on the > > server side instead of client side, or is there another way to retrieve > > all the file paths? > > Also, for each of my thousand files, a getBlockLocation RPC call is/was > > generated - I implemented/added a getBlockLocations[] method that > > accepts an array of paths etc. and returns a String[][][] matrix instead > > which is much more very efficient then generating thousands of RPC calls > > when calling getBlockLocation in the MultiFileSplit class... > > Any thoughts/comments are much appreciated! > > Thanks in advance! > > > > Cu on the 'net, > > Bye - bye, > > > > <<<<< André <<<< >>>> èrbnA >>>>> > > > > Enis Soztutar wrote: > >> Hi, > >> > >> I think you should try using MultiFileInputFormat/MultiFileInputSplit > >> rather than FileSplit, since the former is optimized for processing > >> large number of files. Could you report you numMaps and numReduces and > >> the avarage time the map() function is expected to take. > > > >
-- Roch DELSALLE 55, rue de Bellechasse. 75007 Paris FRANCE [EMAIL PROTECTED] +44 (7) 72 61 05 25 6 +33 (6) 69 70 60 25 +33 (1) 42 84 12 33
