Hadoop efficiency question

Lukas Vlcek Sat, 18 Aug 2007 00:14:10 -0700

Hi,

I have just found an interesting video called "Scalability and Efficiency on
Data Mining Applied to Internet Applications". The link is:
http://video.google.com/videoplay?docid=2980110657131275963


It touches MapReduce paradigm and wide portion of the presentation is
devoted to classical data mining task Frequent Itemset Mining (experimental
results for other tasks are presented as well). If I undestood correctly
then one of the main points of this presentation is that current MapReduce
is great for stateless computations but it can be a problem (less effective)
when stateful approach is needed. For their needs they created MapReduce
derived implementations where each Reduce phase can store results and other
metadata into external repository so that other tasks can learn about it
very fast (so that subsequent Map task can start earlier if it has all the
information it needs and dose not have to wait until the whole Map phase
finishes).

Would this be possible in current Hadoop implementation? Or would such
modification go far beyond current Hadoop architectural concepts? (I noticed
the question from audience in the end of the presentation was about node
failures so maybe even big guys from Google haven't been using this approach
yet)

Regards,
Lukas

Hadoop efficiency question

Reply via email to