Re: Very weak mapred performance on small clusters with a massive amount of small files

Roch DELSALLE Sun, 04 Nov 2007 16:50:07 -0800

Hi,
Is there any function/script to do that in Hadoop ?
thanks

On 11/4/07, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>
>
> If your larger run is typical of your smaller run then you have lots and
> lots of small files.  This is going to make things slow even without the
> overhead of a distributed computation.
>
> In the sequential case, enumerating the files an inefficient read patterns
> will be what slows you down.  The inefficient reads come about because the
> disk has to seek every 100KB of input.  That is bad.
>
> In the hadoop case, things are worse because opening a file takes much
> longer than with local files.
>
> The solution is for you to package your data more efficiently.  This fixes
> a
> multitude of ills.  If you don't mind limiting your available parallelism
> a
> little bit, you could even use tar files (tar isn't usually recommended
> because you can't split a tar file across maps).
>
> If you were to package 1000 files per bundle, you would get average file
> sizes of 100MB instead of 100KB and your file opening overhead in the
> parallel case would be decreased by 1000x.  Your disk read speed would be
> much higher as well because your disks would mostly be reading contiguous
> sectors.
>
> I have a system similar to yours with lots and lots of little files
> (littler
> than yours even).  With aggressive file bundling I can routinely process
> data at a sustained rate of 100MB/s on ten really crummy storage/compute
> nodes.  Moreover, that rate is probably not even bounded by I/O since my
> data takes a fair bit of CPU to decrypt and parse.
>
>
> On 11/4/07 4:02 PM, "André Martin" <[EMAIL PROTECTED]> wrote:
>
> > Hi Enis & Hadoopers,
> > thanks for the hint. I created/modified my RecordReader so that it uses
> > MultiFileInputSplit and reads 30 files at once (by spawning several
> > threads and using a bounded buffer àla producer/consumer). The
> > accumulated throughput is now about 1MB/s on my 30 MB test data (spread
> > over 300 files).
> > However, I noticed some other bottlenecks during job submissions - a job
> > submission of 53.000 files spread over 18,150 folders takes about 1hr
> > and 45 mins..
> > Since all the files are spread over severals thousand directories -
> > listing/traversing of those directories using the listpath / globpaths
> > method generates several thousands RPC calls. I think it would be more
> > efficient to send the regex/path expression (the parameters) of the
> > globpaths method to the server and traversing the directory tree on the
> > server side instead of client side, or is there another way to retrieve
> > all the file paths?
> > Also, for each of my thousand files, a getBlockLocation RPC call is/was
> > generated - I implemented/added a getBlockLocations[] method that
> > accepts an array of paths etc. and returns a String[][][] matrix instead
> > which is much more very efficient then generating thousands of RPC calls
> > when calling getBlockLocation in the MultiFileSplit class...
> > Any thoughts/comments are much appreciated!
> > Thanks in advance!
> >
> >  Cu on the 'net,
> >                        Bye - bye,
> >
> >                                   <<<<< André <<<< >>>> èrbnA >>>>>
> >
> > Enis Soztutar wrote:
> >> Hi,
> >>
> >> I think you should try using MultiFileInputFormat/MultiFileInputSplit
> >> rather than FileSplit, since the former is optimized for processing
> >> large number of files. Could you report you numMaps and numReduces and
> >> the avarage time the map() function is expected to take.
> >
>
>



-- 
Roch DELSALLE
55, rue de Bellechasse.
75007 Paris
FRANCE
[EMAIL PROTECTED]
+44 (7) 72 61 05 25 6
+33 (6) 69 70 60 25
+33 (1) 42 84 12 33

Re: Very weak mapred performance on small clusters with a massive amount of small files

Reply via email to