I would really appreciate any help people can offer on the following matters.
When running a streaming job, -D, -files, -libjars, and -archives don't seem
work, but -jobconf, -file, -cacheFile, and -cacheArchive do. With the first
four parameters anywhere in command I always get a "Streaming Command Failed!"
error. The last four work though. Note that some of those parameters (-files)
do work when I a run a Hadoop job in the normal framework, just not when I
specify the streaming jar.
How do I specify a Java class as the reducer? I have found examples online,
but they always reference "built-in" classes. If I try to use my own class,
the job tracker produces a "Cannot run program "org.uw.astro.coadd.Reducer2":
java.io.IOException: error=2, No such file or directory" error. As you can see
from my first question, I am certainly trying to find ways to include the .jar
file containing the class in the distributed cache, but -libjars and -archives
don't work, and if I upload the .jar to the cluster and use -cacheArchives, the
command runs but I still get the "No such file" error. I can use native
compiled programs for the mapper and reducer just fine, but not a Java class.
I want a native mapper and a Java reducer. My native mapper runs, but then the
Java reducer fails as described.
How do I force a single record (input file) to be processed by a single mapper
to get maximum parallelism? All I found online was this terse description (of
an example that gzips files, not my application):
• Generate a file containing the full HDFS path of the input files.
Each map task would get one file name as input.
• Create a mapper script which, given a filename, will get the file to
local disk, gzip the file and put it back in the desired output directory
I don't understand exactly what that means and how to go about doing it. In
the normal Hadoop framework I have achieved this goal by setting
mapred.max.split.size small enough that only one input record fits (about
6MBs), but I tried that with my streaming job ala "-jobconf
mapred.max.split.size=X" where X is a very low number, about as many as a
single streaming input record (which in the streaming case is not 6MB, but
merely ~100 bytes, just a filename referenced ala -cacheFile), but it didn't
work, it sent multiple records to each mapper anyway. Achieving 1-to-1
parallelism between map tasks, nodes, and input records is very import because
my map tasks take a very long time to run, upwards of an hour. I cannot have
them queueing up on a small number of nodes while there are numerous unused
nodes (task slots) available to be doing work.
I realize I'm asking a lot of questions here, but I would greatly appreciate
any assistance on these issues.
Thanks.
________________________________________________________________________________
Keith Wiley [email protected] keithwiley.com music.keithwiley.com
"Luminous beings are we, not this crude matter."
-- Yoda
________________________________________________________________________________