Hi, I have just found an interesting video called "Scalability and Efficiency on Data Mining Applied to Internet Applications". The link is: http://video.google.com/videoplay?docid=2980110657131275963
It touches MapReduce paradigm and wide portion of the presentation is devoted to classical data mining task Frequent Itemset Mining (experimental results for other tasks are presented as well). If I undestood correctly then one of the main points of this presentation is that current MapReduce is great for stateless computations but it can be a problem (less effective) when stateful approach is needed. For their needs they created MapReduce derived implementations where each Reduce phase can store results and other metadata into external repository so that other tasks can learn about it very fast (so that subsequent Map task can start earlier if it has all the information it needs and dose not have to wait until the whole Map phase finishes). Would this be possible in current Hadoop implementation? Or would such modification go far beyond current Hadoop architectural concepts? (I noticed the question from audience in the end of the presentation was about node failures so maybe even big guys from Google haven't been using this approach yet) Regards, Lukas
