Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "HadoopStreaming" page has been changed by WimDepoorter.
The comment on this change is: changed the location of streaming jar from 
"build/hadoop-streaming.jar" to 
"$HADOOP_HOME/mapred/contrib/streaming/hadoop-0.xx.y-streaming.jar".
http://wiki.apache.org/hadoop/HadoopStreaming?action=diff&rev1=11&rev2=12

--------------------------------------------------

  Hadoop Streaming is a utility which allows users to create and run jobs with 
any executables (e.g. shell utilities) as the mapper and/or the reducer.
  
  {{{
- 
- Usage: $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar [options]
+ Usage: $HADOOP_HOME/bin/hadoop jar 
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options]
  Options:
    -input    <path>                   DFS input file(s) for the Map step
    -output   <path>                   DFS output directory for the Reduce step
@@ -55, +54 @@

     -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
  
  Shortcut to run from any directory:
-    setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar 
$HADOOP_HOME/build/hadoop-streaming.jar"
+    setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar 
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar"
  
  Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
             -file /local/filter.pl -input "/logs/0604*/*" [...]
    Ships a script, invokes the non-shipped perl interpreter
    Shipped files go to the working directory so filter.pl is found by perl
    Input files are all the daily logs for days in month 2006-04
- 
  }}}
- 
- 
  == Practical Help ==
  Using the streaming system you can develop working hadoop jobs with 
''extremely'' limited knowldge of Java.  At it's simplest your development task 
is to write two shell scripts that work well together, let's call them 
'''shellMapper.sh''' and '''shellReducer.sh'''.  On a machine that doesn't even 
have hadoop installed you can get first drafts of these working by writing them 
to work in this way:
  
  {{{
  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile
  }}}
- 
  With streaming, Hadoop basically becomes a system for making pipes from 
shell-scripting work (with some fudging) on a cluster.  There's a strong 
logical correspondence between the unix shell scripting environment and hadoop 
streaming jobs.  The above example with Hadoop has somewhat less elegant 
syntax, but this is what it looks like:
  
  {{{
- stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper 
"shellMapper.sh" -file shellReducer.sh  -reducer "shellReducer.sh" -output 
/dfsOutputDir/myResults  
+ stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper 
"shellMapper.sh" -file shellReducer.sh  -reducer "shellReducer.sh" -output 
/dfsOutputDir/myResults
  }}}
- 
- The real place the logical correspondence breaks down is that in a one 
machine scripting environment shellMapper.sh and shellReducer.sh will each run 
as a single process and data will flow directly from one process to the other.  
With Hadoop the shellMapper.sh file will be sent to every machine on the 
cluster that has data chunks and each such machine will run it's own chunk 
through the shellMapper.sh process on each machine.  The output from those 
scripts ''doesn't'' run a reduce on each of those machines.  Instead the output 
is sorted so that different lines from various mapping jobs are streamed across 
the network to different machines (Hadoop defaults to four machines) where the 
reduce(s) can be performed.  
+ The real place the logical correspondence breaks down is that in a one 
machine scripting environment shellMapper.sh and shellReducer.sh will each run 
as a single process and data will flow directly from one process to the other.  
With Hadoop the shellMapper.sh file will be sent to every machine on the 
cluster that has data chunks and each such machine will run it's own chunk 
through the shellMapper.sh process on each machine.  The output from those 
scripts ''doesn't'' run a reduce on each of those machines.  Instead the output 
is sorted so that different lines from various mapping jobs are streamed across 
the network to different machines (Hadoop defaults to four machines) where the 
reduce(s) can be performed.
  
  Here are practical tips for getting things working well:
  

Reply via email to