This is one of the poster-child use cases for the -libjars flag to hadoop's
shell script.  Have you tried to see if that works?

  -jake

On Thu, Jul 21, 2011 at 5:15 AM, Grant Ingersoll <[email protected]>wrote:

> Yeah, I ended up creating an alternate Jar, but I also don't know that our
> script is doing as it is supposed to here.  Or, I guess better said, it
> would be desirable if we were able to make this easier for people.
>
> -Grant
>
> On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote:
>
> > I have faced this problem in the past, the solution was to add the
> analyzer
> > jar to the job's jar [1] in order to have the analyzer installed in the
> > cluster nodes.
> >
> > [1]
> >
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
> >
> > On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected]
> >wrote:
> >
> >> I'm trying to understand a bit what our preferred mechanism is for users
> to
> >> add custom libraries to the Mahout classpath when running on Hadoop.
>  The
> >> obvious case that comes to mind is adding your own Lucene Analyzer,
> which is
> >> what I am trying to do.
> >>
> >> In looking at bin/mahout, we define CLASSPATH, in the non-core case to
> be:
> >> # add release dependencies to CLASSPATH
> >> for f in $MAHOUT_HOME/mahout-*.jar; do
> >>   CLASSPATH=${CLASSPATH}:$f;
> >> done
> >>
> >> # add dev targets if they exist
> >> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do
> >>   CLASSPATH=${CLASSPATH}:$f;
> >> done
> >>
> >> # add release dependencies to CLASSPATH
> >> for f in $MAHOUT_HOME/lib/*.jar; do
> >>   CLASSPATH=${CLASSPATH}:$f;
> >> done
> >>
> >> From the looks of it, I could, on trunk, add in a lib directory and just
> >> shove my dependency into that dir.
> >>
> >> However, further down, we don't seem to use that CLASSPATH, except when
> in
> >> LOCAL mode or "hadoop" mode:
> >> if [ "$1" = "hadoop" ]; then
> >>     export
> >> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
> >>     exec "$HADOOP_HOME/bin/$@"
> >> else
> >>     echo "MAHOUT-JOB: $MAHOUT_JOB"
> >>     export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
> >>     exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar
> >> $MAHOUT_JOB $CLASS "$@"
> >> fi
> >>
> >> So this means, I should force "hadoop" mode by doing:
> >> ./bin/mahout hadoop
> >> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ...
> >> --analyzerName my.great.Analyzer
> >>
> >> instead of:
> >> ./bin/mahout seq2sparse ...
> >>
> >> However, I still get Class Not Found even though when I echo the
> >> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer.
> >>
> >> Any insight?
> >>
> >> --------------------------
> >> Grant Ingersoll
> >>
> >>
> >>
> >>
>
> --------------------------
> Grant Ingersoll
>
>
>
>

Reply via email to