[ https://issues.apache.org/jira/browse/HADOOP-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635515#action_12635515 ]
Owen O'Malley commented on HADOOP-4304: --------------------------------------- I've only looked very casually at this, but: * I'd suggest that run should take either methods or classes for map and reduce. That would remove the need for mapconf and reduceconf parameters that are doing initialization. Basically, I'd like to see: {code} class Tokenizer(Mapper): def __init__(self): file = open("excludes.txt","r") self.excludes = set(line.strip() for line in file) file.close() def map(self, key, value, context): for word in value.split(): if not (word in self.excludes): yield word,1 class Summer(Reducer): def reduce(self, key, values, context): yield key,sum(values) {code} Of course, I'd suggest leaving your current map and reduce methods also. So that you could either do: {code} dumbo.run(Tokenizer, Summer, combiner=Summer); - or - dumbo.run(my_map, my_reduce, combiner=my_reduce); {code} Personally, I'd rather have you use Swig and the Pipes C++ interface rather than streaming, but I'm biased. *Smile* (Although it would give you better performance, and allow binary data to be processed.) > Add Dumbo to contrib > -------------------- > > Key: HADOOP-4304 > URL: https://issues.apache.org/jira/browse/HADOOP-4304 > Project: Hadoop Core > Issue Type: New Feature > Reporter: Klaas Bosteels > Priority: Minor > > Originally, Dumbo was a simple Python module developed at Last.fm to make > writing and running Hadoop Streaming programs very easy, but now it also > consists of some (up till now unreleased) helper code in Java (although it > can still be used without the Java code). We propose to add Dumbo to > "src/contrib" such that the Java classes get build/installed together with > the rest of Hadoop, and the Python module can be installed separately at > will. A tar.gz of the directory that would have to be added to "src/contrib" > is available at > http://static.last.fm/dumbo/dumbo-contrib.tar.gz > and more info about Dumbo can be found here: > * Basic documentation: http://github.com/klbostee/dumbo/wikis > * Presentation at HUG (where it was first suggested to add Dumbo to contrib): > http://skillsmatter.com/podcast/home/dumbo-hadoop-streaming-made-elegant-and-easy > * Initial announcement: > http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant > For some of the more advanced features of Dumbo (in particular the ones for > which the Java classes are needed) there is no public documentation yet, but > we could easily fill that gap by moving some of the internal Last.fm > documentation to the Hadoop wiki. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.