Normally I am loath to link stuff from Slashdot, but the article on MapReduce <http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html> caught my attention because I had just read an article in "Communications of the ACM" January 2008, pages 107-113 by Jeffrey Dean and Sanjay Ghemawat on just this subject.

There are several fallacies in the linked article that are cleverly hidden. One of them that I see is the reference to indexes to support queries. Although an index can be helpful for repeated queries on the same dataset, they neglect to say that building the index takes up time and space, and for queries on ad-hoc datasets you will not have any advantage over a single pass through the dataset since you have to do that anyway to build the index.

Perhaps the biggest problem with the linked article is their total silence on the question of what you do in the event of failure. The ACM article addresses this specifically and shows how they recover from both failed and even just slow nodes.

As an example of what MapReduce at Google (where the ACM article authors work) can do, the ACM article shows doing a grep on 10^10 (10 billion) 100-byte records looking for a relatively rare three-character pattern. The whole process is farmed out to 1800 nodes and completes in 180 seconds from start to finish. They have some nice little graphs to go along with it.

Gus


--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Reply via email to