An observation... this whole thread is about limits caused by type
safety. Interestingly, the other implementation of map-reduce does
not support types at all. Everything is a string.
So I agree that our departure from the paper is the problem. ;-)
I'm comfortable letting this lie for a while. But I predict we've
not heard the last of it.
On Apr 2, 2006, at 10:29 PM, Doug Cutting wrote:
Runping Qi wrote:
The argument of using local combiners is interesting. To me,
combiner class
is just another layer of transformer. It does not mean that the
combiner
class has to be the same as the reducer class. The only criteria
is that
they meet the associate rule: Let L1, L2, ..., Ln and K1,
K2, .., Km be two partitions of S, then Reduce(list(Combiner(L1),
Combiner(L2),..., Combiner(Ln))) and Reduce(list(Combiner(K1),
Combiner(K2), ..., Combiner(Km)) are the
same.
A special (maybe very common) scenario is that combiner and
reducer are the
same class and reduce function is associate. However, this needs
not to be
the case in general. And the class of the reduce outputs need not
to be the
same as that of the combiner, if the combiner and the reducer are
not the
same class.
This indeed may be be an intriguing generalization of the MapReduce
model. But it does add more possible failure modes. At present we
have far too few unit tests for the existing, simpler MapReduce
model, and the platform is still shakey. Thus I am reluctant to
spend a lot of extending the model in ways that are not absolutely
essential.
My goal is for Hadoop to be widely used. I do not feel that the
power of the MapReduce model is currently a primary bottleneck to
wider adoption. The larger issues we face are performance,
reliability, scalability and documentation.
If I am to commit a patch, then I must feel that I can support and
maintain it, that it fits within my priorities. Otherwise, if it
causes problems that I don't have time to attend to (even if this
only means reviewing and testing fixes submitted by others) then
the quality of the system will decrease, a vector we must avoid.
Currently we have just four committers on Hadoop. For Mike and
Andrzej, Nutch is a secondary effort. Owen has been voted in as a
Hadoop committer, but his paperwork is not yet complete. So I am
the bottleneck. I spend a lot of time on annoying yet critical
issues like making sure that recent extensions to Hadoop don't
break Nutch running in pseudo-distributed mode on Windows.
I don't particularly like things this way, but that's where we are
right now. The best way to get out of here is for folks who'd like
to be committers to submit high-quality, well documented, well-
formatted, non-disruptive, unit-test-bearing patches that are easy
for me to apply and make Hadoop easier to use and more reliable,
thus earning points towards becoming committers. If we have more
committers then we should be able to advance with confidence on
more fronts in parallel.
Doug