[jira] Commented: (HADOOP-4304) Add Dumbo to contrib

Klaas Bosteels (JIRA) Thu, 13 Nov 2008 17:03:46 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647488#action_12647488
 ]


Klaas Bosteels commented on HADOOP-4304:
----------------------------------------

It actually is necessary that the Dumbo python module gets installed in the 
system directory of the machine from which you run your jobs, because when you 
run e.g.

python wordcount.py -hadoop ~hadoop/hadoop-0.17.2.1 -input brian.txt -output 
brian-wc -file excludes.txt

the "dumbo.run()" call gets executed on that machine, which makes Dumbo 
generate and excute the Streaming command

/home/hadoop/hadoop-0.17.2.1/bin/hadoop jar 
/home/hadoop/hadoop-0.17.2.1/contrib/streaming/hadoop-0.17.2.1-streaming.jar 
-input 'brian.txt' -output 'brian-wc' -file 'excludes.txt' -mapper 'python 
wordcount.py map 0' -reducer 'python wordcount.py red 0' -file 'wordcount.py' 
-file '/usr/lib/python2.4/site-packages/dumbo.py' -jobconf 
'mapred.job.name=wordcount.py'

under the hood. The part 

-file '/usr/lib/python2.4/site-packages/dumbo.py'

of this automatically generated command makes sure that the Dumbo module from 
the system dir is put in the working dir on each cluster node (i.e. it is not 
necessary to install the Dumbo python module in the system dir on the cluster 
nodes), and the extra args supplied to "python wordcount.py" in

-mapper 'python wordcount.py map 0'
-reducer 'python wordcount.py red 0' 

allow Dumbo to know that it has to run the actual map or reduce instead of 
generating and executing a Streaming command on the nodes. Hence, we need the 
ability to install Dumbo in the system dir in order to make it very easy to 
start jobs, which is one of the main features of dumbo (some more info about 
this can be found at http://github.com/klbostee/dumbo/wikis/running-programs).


I also do not think that the "classes" and "test" dirs are superfluous (but I 
might very well be wrong). As mentioned in the description of this ticket, 
Dumbo also consists of some helper Java code like e.g. a special inputformat 
that allows to easily use sequencefiles as input for Dumbo programs (there 
currently is no public documentation for these features available, but we are 
planning to change that once this patch gets approved), and there are also some 
unit tests for this helper code.


Concerning the NFS mounting I'm not really sure what you mean. Maybe that 
"excludes.txt" has to be in the cwd? This is handled by adding the option 
"-file excludes.txt" to the command (like in the command above), so there is no 
need for any NFS mounts...



> Add Dumbo to contrib
> --------------------
>
>                 Key: HADOOP-4304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4304
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>            Priority: Minor
>         Attachments: hadoop-4304-v2.patch, hadoop-4304-v3.patch, 
> hadoop-4304.patch
>
>
> Originally, Dumbo was a simple Python module developed at Last.fm to make 
> writing and running Hadoop Streaming programs very easy, but now it also 
> consists of some (up till now unreleased) helper code in Java (although it 
> can still be used without the Java code). We propose to add Dumbo to 
> "src/contrib" such that the Java classes get build/installed together with 
> the rest of Hadoop, and the Python module can be installed separately at 
> will. A tar.gz of the directory that would have to be added to "src/contrib" 
> is available at
> http://static.last.fm/dumbo/dumbo-contrib.tar.gz
> and more info about Dumbo can be found here:
> * Basic documentation: http://github.com/klbostee/dumbo/wikis
> * Presentation at HUG (where it was first suggested to add Dumbo to contrib): 
> http://skillsmatter.com/podcast/home/dumbo-hadoop-streaming-made-elegant-and-easy
> * Initial announcement: 
> http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant
> For some of the more advanced features of Dumbo (in particular the ones for 
> which the Java classes are needed) there is no public documentation yet, but 
> we could easily fill that gap by moving some of the internal Last.fm 
> documentation to the Hadoop wiki.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4304) Add Dumbo to contrib

Reply via email to