André Martin wrote:
I was thinking of a similar solution/optimization but I have the following problem: We have a large distributed system that consists of several spider/crawler nodes - pretty much like a web crawler system - every node writes its gathered data directly to the DFS. So there is no real possibility of bundling the data while it is written to the DFS since two spiders may write some data for the same logical unit concurrently - if the DFS would support synchronized append writes, it would make our lifes a little bit easier. However, our files are still organized in thousands of directories / a pretty large directory tree since I need only certain branches for a mapred operation in order to do some data mining...

Instead of organizing output into many directories you might consider using keys which encode that directory structure. Then mapreduce can use these to partition output. If you wish to mine only a subset of your data, you can process just those partitions which contain the portions of the keyspace you're interested in.

Doug

Reply via email to