Multiple various streaming questions

Keith Wiley Tue, 01 Feb 2011 23:40:54 -0800

I would really appreciate any help people can offer on the following matters.


When running a streaming job, -D, -files, -libjars, and -archives don't seem 
work, but -jobconf, -file, -cacheFile, and -cacheArchive do.  With the first 
four parameters anywhere in command I always get a "Streaming Command Failed!" 
error.  The last four work though.  Note that some of those parameters (-files) 
do work when I a run a Hadoop job in the normal framework, just not when I 
specify the streaming jar.

How do I specify a Java class as the reducer?  I have found examples online, 
but they always reference "built-in" classes.  If I try to use my own class, 
the job tracker produces a "Cannot run program "org.uw.astro.coadd.Reducer2": 
java.io.IOException: error=2, No such file or directory" error.  As you can see 
from my first question, I am certainly trying to find ways to include the .jar 
file containing the class in the distributed cache, but -libjars and -archives 
don't work, and if I upload the .jar to the cluster and use -cacheArchives, the 
command runs but I still get the "No such file" error.  I can use native 
compiled programs for the mapper and reducer just fine, but not a Java class.  
I want a native mapper and a Java reducer.  My native mapper runs, but then the 
Java reducer fails as described.

How do I force a single record (input file) to be processed by a single mapper 
to get maximum parallelism?  All I found online was this terse description (of 
an example that gzips files, not my application):
        • Generate a file containing the full HDFS path of the input files. 
Each map task would get one file name as input.
        • Create a mapper script which, given a filename, will get the file to 
local disk, gzip the file and put it back in the desired output directory
I don't understand exactly what that means and how to go about doing it.  In 
the normal Hadoop framework I have achieved this goal by setting 
mapred.max.split.size small enough that only one input record fits (about 
6MBs), but I tried that with my streaming job ala "-jobconf 
mapred.max.split.size=X" where X is a very low number, about as many as a 
single streaming input record (which in the streaming case is not 6MB, but 
merely ~100 bytes, just a filename referenced ala -cacheFile), but it didn't 
work, it sent multiple records to each mapper anyway.  Achieving 1-to-1 
parallelism between map tasks, nodes, and input records is very import because 
my map tasks take a very long time to run, upwards of an hour.  I cannot have 
them queueing up on a small number of nodes while there are numerous unused 
nodes (task slots) available to be doing work.

I realize I'm asking a lot of questions here, but I would greatly appreciate 
any assistance on these issues.

Thanks.

________________________________________________________________________________
Keith Wiley     [email protected]     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________

Multiple various streaming questions

Reply via email to