Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Doug Cutting Sun, 02 Apr 2006 22:29:42 -0700

Runping Qi wrote:

The argument of using local combiners is interesting. To me, combiner class
is just another layer of transformer.  It does not mean that the combiner
class has to be the same as the reducer class. The only criteria is that

they meet the associate rule:Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, thenReduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) andReduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the

same.


A special (maybe very common) scenario is that combiner and reducer are the
same class and reduce function is associate. However, this needs not to be
the case in general. And the class of the reduce outputs need not to be the
same as that of the combiner, if the combiner and the reducer are not the
same class.

This indeed may be be an intriguing generalization of the MapReducemodel. But it does add more possible failure modes. At present we havefar too few unit tests for the existing, simpler MapReduce model, andthe platform is still shakey. Thus I am reluctant to spend a lot ofextending the model in ways that are not absolutely essential.

My goal is for Hadoop to be widely used. I do not feel that the powerof the MapReduce model is currently a primary bottleneck to wideradoption. The larger issues we face are performance, reliability,scalability and documentation.

If I am to commit a patch, then I must feel that I can support andmaintain it, that it fits within my priorities. Otherwise, if it causesproblems that I don't have time to attend to (even if this only meansreviewing and testing fixes submitted by others) then the quality of thesystem will decrease, a vector we must avoid.

Currently we have just four committers on Hadoop. For Mike and Andrzej,Nutch is a secondary effort. Owen has been voted in as a Hadoopcommitter, but his paperwork is not yet complete. So I am thebottleneck. I spend a lot of time on annoying yet critical issues likemaking sure that recent extensions to Hadoop don't break Nutch runningin pseudo-distributed mode on Windows.

I don't particularly like things this way, but that's where we are rightnow. The best way to get out of here is for folks who'd like to becommitters to submit high-quality, well documented, well-formatted,non-disruptive, unit-test-bearing patches that are easy for me to applyand make Hadoop easier to use and more reliable, thus earning pointstowards becoming committers. If we have more committers then we shouldbe able to advance with confidence on more fronts in parallel.


Doug

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Reply via email to