Runping Qi wrote:
The argument of using local combiners is interesting. To me, combiner class
is just another layer of transformer. It does not mean that the combiner
class has to be the same as the reducer class. The only criteria is that
they meet the associate rule:
Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then
Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and
Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
same.
A special (maybe very common) scenario is that combiner and reducer are the
same class and reduce function is associate. However, this needs not to be
the case in general. And the class of the reduce outputs need not to be the
same as that of the combiner, if the combiner and the reducer are not the
same class.
This indeed may be be an intriguing generalization of the MapReduce
model. But it does add more possible failure modes. At present we have
far too few unit tests for the existing, simpler MapReduce model, and
the platform is still shakey. Thus I am reluctant to spend a lot of
extending the model in ways that are not absolutely essential.
My goal is for Hadoop to be widely used. I do not feel that the power
of the MapReduce model is currently a primary bottleneck to wider
adoption. The larger issues we face are performance, reliability,
scalability and documentation.
If I am to commit a patch, then I must feel that I can support and
maintain it, that it fits within my priorities. Otherwise, if it causes
problems that I don't have time to attend to (even if this only means
reviewing and testing fixes submitted by others) then the quality of the
system will decrease, a vector we must avoid.
Currently we have just four committers on Hadoop. For Mike and Andrzej,
Nutch is a secondary effort. Owen has been voted in as a Hadoop
committer, but his paperwork is not yet complete. So I am the
bottleneck. I spend a lot of time on annoying yet critical issues like
making sure that recent extensions to Hadoop don't break Nutch running
in pseudo-distributed mode on Windows.
I don't particularly like things this way, but that's where we are right
now. The best way to get out of here is for folks who'd like to be
committers to submit high-quality, well documented, well-formatted,
non-disruptive, unit-test-bearing patches that are easy for me to apply
and make Hadoop easier to use and more reliable, thus earning points
towards becoming committers. If we have more committers then we should
be able to advance with confidence on more fronts in parallel.
Doug