Binary tree reduction

Martin Nilsson Tue, 12 Feb 2008 12:53:43 -0800


Hello,

I'm looking at a problem where we need to determine the number of uniqueentities for a specific period of time. As an example, consider that welog all outgoing URLs in a set of proxies. We can easily create a mapperthat turns every log file or slice thereof into a sorted list of URLs.Unix sort scales very well, but we are only operating on output from atmost one log file for a day anyway.

Now for reduction I would like to take two (or more?) of these files,simply containing a line based list of sorted URLs, and merge them intoa single file, removing any duplicates. This is a fast operation andtakes constant memory, but requires that the complete files are operatedon by the same reducer. Also the key-value paradigm doesn't apply.

The end product would be a big file with URLs for that day. When URLsfor e.g. a week or a month are available, those should be merged intoaggregates. I'm really only interested in the final row count, but Ineed to keep all the URLs to be able to add the statistics properly.

Is what I've described readily available within Hadoop (I did somelooking but didn't find anything)? If not, do you have any pointers forhow to achieve this type of processing?


/Martin Nilsson

Binary tree reduction

Reply via email to