Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Eric Baldeschwieler Mon, 03 Apr 2006 10:30:08 -0700

An observation... this whole thread is about limits caused by typesafety. Interestingly, the other implementation of map-reduce doesnot support types at all. Everything is a string.


So I agree that our departure from the paper is the problem.  ;-)

I'm comfortable letting this lie for a while. But I predict we'venot heard the last of it.


On Apr 2, 2006, at 10:29 PM, Doug Cutting wrote:

Runping Qi wrote:
The argument of using local combiners is interesting. To me,combiner classis just another layer of transformer. It does not mean that thecombinerclass has to be the same as the reducer class. The only criteriais thatthey meet the associate rule: Let L1, L2, ..., Ln and K1,K2, .., Km be two partitions of S, then Reduce(list(Combiner(L1),Combiner(L2),..., Combiner(Ln))) and Reduce(list(Combiner(K1),Combiner(K2), ..., Combiner(Km)) are the
same.
A special (maybe very common) scenario is that combiner andreducer are thesame class and reduce function is associate. However, this needsnot to bethe case in general. And the class of the reduce outputs need notto be thesame as that of the combiner, if the combiner and the reducer arenot the
same class.
This indeed may be be an intriguing generalization of the MapReducemodel. But it does add more possible failure modes. At present wehave far too few unit tests for the existing, simpler MapReducemodel, and the platform is still shakey. Thus I am reluctant tospend a lot of extending the model in ways that are not absolutelyessential.
My goal is for Hadoop to be widely used. I do not feel that thepower of the MapReduce model is currently a primary bottleneck towider adoption. The larger issues we face are performance,reliability, scalability and documentation.
If I am to commit a patch, then I must feel that I can support andmaintain it, that it fits within my priorities. Otherwise, if itcauses problems that I don't have time to attend to (even if thisonly means reviewing and testing fixes submitted by others) thenthe quality of the system will decrease, a vector we must avoid.
Currently we have just four committers on Hadoop. For Mike andAndrzej, Nutch is a secondary effort. Owen has been voted in as aHadoop committer, but his paperwork is not yet complete. So I amthe bottleneck. I spend a lot of time on annoying yet criticalissues like making sure that recent extensions to Hadoop don'tbreak Nutch running in pseudo-distributed mode on Windows.
I don't particularly like things this way, but that's where we areright now. The best way to get out of here is for folks who'd liketo be committers to submit high-quality, well documented, well-formatted, non-disruptive, unit-test-bearing patches that are easyfor me to apply and make Hadoop easier to use and more reliable,thus earning points towards becoming committers. If we have morecommitters then we should be able to advance with confidence onmore fronts in parallel.
Doug

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Reply via email to