Normally I am loath to link stuff from Slashdot, but the article on
MapReduce
<http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html>
caught my attention because I had just read an article in
"Communications of the ACM" January 2008, pages 107-113 by Jeffrey Dean
and Sanjay Ghemawat on just this subject.
There are several fallacies in the linked article that are cleverly
hidden. One of them that I see is the reference to indexes to support
queries. Although an index can be helpful for repeated queries on the
same dataset, they neglect to say that building the index takes up time
and space, and for queries on ad-hoc datasets you will not have any
advantage over a single pass through the dataset since you have to do
that anyway to build the index.
Perhaps the biggest problem with the linked article is their total
silence on the question of what you do in the event of failure. The ACM
article addresses this specifically and shows how they recover from both
failed and even just slow nodes.
As an example of what MapReduce at Google (where the ACM article authors
work) can do, the ACM article shows doing a grep on 10^10 (10 billion)
100-byte records looking for a relatively rare three-character pattern.
The whole process is farmed out to 1800 nodes and completes in 180
seconds from start to finish. They have some nice little graphs to go
along with it.
Gus
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list