This is one of the poster-child use cases for the -libjars flag to hadoop's shell script. Have you tried to see if that works?
-jake On Thu, Jul 21, 2011 at 5:15 AM, Grant Ingersoll <[email protected]>wrote: > Yeah, I ended up creating an alternate Jar, but I also don't know that our > script is doing as it is supposed to here. Or, I guess better said, it > would be desirable if we were able to make this easier for people. > > -Grant > > On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote: > > > I have faced this problem in the past, the solution was to add the > analyzer > > jar to the job's jar [1] in order to have the analyzer installed in the > > cluster nodes. > > > > [1] > > > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > > > > On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected] > >wrote: > > > >> I'm trying to understand a bit what our preferred mechanism is for users > to > >> add custom libraries to the Mahout classpath when running on Hadoop. > The > >> obvious case that comes to mind is adding your own Lucene Analyzer, > which is > >> what I am trying to do. > >> > >> In looking at bin/mahout, we define CLASSPATH, in the non-core case to > be: > >> # add release dependencies to CLASSPATH > >> for f in $MAHOUT_HOME/mahout-*.jar; do > >> CLASSPATH=${CLASSPATH}:$f; > >> done > >> > >> # add dev targets if they exist > >> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do > >> CLASSPATH=${CLASSPATH}:$f; > >> done > >> > >> # add release dependencies to CLASSPATH > >> for f in $MAHOUT_HOME/lib/*.jar; do > >> CLASSPATH=${CLASSPATH}:$f; > >> done > >> > >> From the looks of it, I could, on trunk, add in a lib directory and just > >> shove my dependency into that dir. > >> > >> However, further down, we don't seem to use that CLASSPATH, except when > in > >> LOCAL mode or "hadoop" mode: > >> if [ "$1" = "hadoop" ]; then > >> export > >> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH > >> exec "$HADOOP_HOME/bin/$@" > >> else > >> echo "MAHOUT-JOB: $MAHOUT_JOB" > >> export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH} > >> exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar > >> $MAHOUT_JOB $CLASS "$@" > >> fi > >> > >> So this means, I should force "hadoop" mode by doing: > >> ./bin/mahout hadoop > >> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ... > >> --analyzerName my.great.Analyzer > >> > >> instead of: > >> ./bin/mahout seq2sparse ... > >> > >> However, I still get Class Not Found even though when I echo the > >> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer. > >> > >> Any insight? > >> > >> -------------------------- > >> Grant Ingersoll > >> > >> > >> > >> > > -------------------------- > Grant Ingersoll > > > >
