Hello,

I'm looking at a problem where we need to determine the number of unique entities for a specific period of time. As an example, consider that we log all outgoing URLs in a set of proxies. We can easily create a mapper that turns every log file or slice thereof into a sorted list of URLs. Unix sort scales very well, but we are only operating on output from at most one log file for a day anyway.

Now for reduction I would like to take two (or more?) of these files, simply containing a line based list of sorted URLs, and merge them into a single file, removing any duplicates. This is a fast operation and takes constant memory, but requires that the complete files are operated on by the same reducer. Also the key-value paradigm doesn't apply.

The end product would be a big file with URLs for that day. When URLs for e.g. a week or a month are available, those should be merged into aggregates. I'm really only interested in the final row count, but I need to keep all the URLs to be able to add the statistics properly.

Is what I've described readily available within Hadoop (I did some looking but didn't find anything)? If not, do you have any pointers for how to achieve this type of processing?

/Martin Nilsson

Reply via email to