Gus Wirth wrote:
Normally I am loath to link stuff from Slashdot, but the article on MapReduce <http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html> caught my attention because I had just read an article in "Communications of the ACM" January 2008, pages 107-113 by Jeffrey Dean and Sanjay Ghemawat on just this subject.
I agree. It is the desperate ramblings of RDBMS guys attempting to hold onto slipping mindshare.
The problem is that there are good points to be made against MapReduce, and they miss them all.
MapReduce has few guarantees. It gets *most* of the records *most* of the time *normally* with an acceptable performance. MapReduce works very well over data which has an unknown structure or when you are hunting through a known structure in a new way. Thus, MapReduce is good for data mining. MapReduce scales *extremely* well.
The moment you need a guarantee, MapReduce falls over. There is no guarantee MapReduce will find a particular record. There is no guarantee MapReduce will not *lose* a record. There is no guarantee MapReduce will return in a reasonable time. And MapReduce eats bandwidth and storage for breakfast, lunch, and dinner.
I'm betting this is one of the reasons GMail sucked for so long. They probably threw the mail store into the MapReduce cluster. Well, that's nice, but they probably needed to replicate it *way* too much for end-user guarantees. I'm betting that Gmail is now off the MapReduce cluster for functionality and just copies mail messages into the MapReduce cluster for searching and mining.
What I would really like to see is MapReduce folded into peer-to-peer like BitTorrent. The problem there is inter-node bandwidth.
-a -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
