[ 
https://issues.apache.org/jira/browse/HADOOP-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaas Bosteels updated HADOOP-4304:
-----------------------------------

    Attachment: hadoop-4304.patch

The comments from Doug and Owen are addressed in the attached patch. An example 
of a Dumbo script that uses Owen's suggestion can be found here 
(oowordcount.py):

http://github.com/klbostee/dumbo/wikis/example-programs


BTW: The main reason why we used Streaming instead of Swig and the Pipes C++ is 
because the ability to run scripts locally using UNIX pipes seems to be very 
valuable in practice (mostly because it makes debugging very easy). Speed was 
not our main concern (we don't mind sacrificing some speed in favour of 
programmer productivity), and non-text input can be dealt with by using special 
InputFormats that convert to text first.

> Add Dumbo to contrib
> --------------------
>
>                 Key: HADOOP-4304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4304
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Klaas Bosteels
>            Priority: Minor
>         Attachments: hadoop-4304.patch
>
>
> Originally, Dumbo was a simple Python module developed at Last.fm to make 
> writing and running Hadoop Streaming programs very easy, but now it also 
> consists of some (up till now unreleased) helper code in Java (although it 
> can still be used without the Java code). We propose to add Dumbo to 
> "src/contrib" such that the Java classes get build/installed together with 
> the rest of Hadoop, and the Python module can be installed separately at 
> will. A tar.gz of the directory that would have to be added to "src/contrib" 
> is available at
> http://static.last.fm/dumbo/dumbo-contrib.tar.gz
> and more info about Dumbo can be found here:
> * Basic documentation: http://github.com/klbostee/dumbo/wikis
> * Presentation at HUG (where it was first suggested to add Dumbo to contrib): 
> http://skillsmatter.com/podcast/home/dumbo-hadoop-streaming-made-elegant-and-easy
> * Initial announcement: 
> http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant
> For some of the more advanced features of Dumbo (in particular the ones for 
> which the Java classes are needed) there is no public documentation yet, but 
> we could easily fill that gap by moving some of the internal Last.fm 
> documentation to the Hadoop wiki.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to