Re: Very weak mapred performance on small clusters with a massive amount of small files

Doug Cutting Tue, 06 Nov 2007 14:24:53 -0800

André Martin wrote:

I was thinking of a similar solution/optimization but I have thefollowing problem:We have a large distributed system that consists of severalspider/crawler nodes - pretty much like a web crawler system - everynode writes its gathered data directly to the DFS. So there is no realpossibility of bundling the data while it is written to the DFS sincetwo spiders may write some data for the same logical unit concurrently -if the DFS would support synchronized append writes, it would make ourlifes a little bit easier.However, our files are still organized in thousands of directories / apretty large directory tree since I need only certain branches for amapred operation in order to do some data mining...

Instead of organizing output into many directories you might considerusing keys which encode that directory structure. Then mapreduce canuse these to partition output. If you wish to mine only a subset ofyour data, you can process just those partitions which contain theportions of the keyspace you're interested in.


Doug

Re: Very weak mapred performance on small clusters with a massive amount of small files

Reply via email to