Hi Ted & Hadoopers,
I was thinking of a similar solution/optimization but I have the following problem: We have a large distributed system that consists of several spider/crawler nodes - pretty much like a web crawler system - every node writes its gathered data directly to the DFS. So there is no real possibility of bundling the data while it is written to the DFS since two spiders may write some data for the same logical unit concurrently - if the DFS would support synchronized append writes, it would make our lifes a little bit easier. However, our files are still organized in thousands of directories / a pretty large directory tree since I need only certain branches for a mapred operation in order to do some data mining...

Cu on the 'net,
                      Bye - bye,

                                 <<<<< André <<<< >>>> èrbnA >>>>>

Ted Dunning wrote:
If your larger run is typical of your smaller run then you have lots and
lots of small files.  This is going to make things slow even without the
overhead of a distributed computation.

In the sequential case, enumerating the files an inefficient read patterns
will be what slows you down.  The inefficient reads come about because the
disk has to seek every 100KB of input.  That is bad.

In the hadoop case, things are worse because opening a file takes much
longer than with local files.

The solution is for you to package your data more efficiently.  This fixes a
multitude of ills.  If you don't mind limiting your available parallelism a
little bit, you could even use tar files (tar isn't usually recommended
because you can't split a tar file across maps).

If you were to package 1000 files per bundle, you would get average file
sizes of 100MB instead of 100KB and your file opening overhead in the
parallel case would be decreased by 1000x.  Your disk read speed would be
much higher as well because your disks would mostly be reading contiguous
sectors.

I have a system similar to yours with lots and lots of little files (littler
than yours even).  With aggressive file bundling I can routinely process
data at a sustained rate of 100MB/s on ten really crummy storage/compute
nodes.  Moreover, that rate is probably not even bounded by I/O since my
data takes a fair bit of CPU to decrypt and parse.


Reply via email to