We respect hadoop args, right?  Or only the -D ones?  We should support
those in bin/mahout, yes.

On Thu, Jul 21, 2011 at 11:06 AM, Grant Ingersoll <[email protected]>wrote:

> I can try it, but more importantly, should we hook it into bin/mahout?
>
> -Grant
>
> On Jul 21, 2011, at 12:29 PM, Jake Mannix wrote:
>
> > This is one of the poster-child use cases for the -libjars flag to
> hadoop's
> > shell script.  Have you tried to see if that works?
> >
> >  -jake
> >
> > On Thu, Jul 21, 2011 at 5:15 AM, Grant Ingersoll <[email protected]
> >wrote:
> >
> >> Yeah, I ended up creating an alternate Jar, but I also don't know that
> our
> >> script is doing as it is supposed to here.  Or, I guess better said, it
> >> would be desirable if we were able to make this easier for people.
> >>
> >> -Grant
> >>
> >> On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote:
> >>
> >>> I have faced this problem in the past, the solution was to add the
> >> analyzer
> >>> jar to the job's jar [1] in order to have the analyzer installed in the
> >>> cluster nodes.
> >>>
> >>> [1]
> >>>
> >>
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
> >>>
> >>> On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected]
> >>> wrote:
> >>>
> >>>> I'm trying to understand a bit what our preferred mechanism is for
> users
> >> to
> >>>> add custom libraries to the Mahout classpath when running on Hadoop.
> >> The
> >>>> obvious case that comes to mind is adding your own Lucene Analyzer,
> >> which is
> >>>> what I am trying to do.
> >>>>
> >>>> In looking at bin/mahout, we define CLASSPATH, in the non-core case to
> >> be:
> >>>> # add release dependencies to CLASSPATH
> >>>> for f in $MAHOUT_HOME/mahout-*.jar; do
> >>>>  CLASSPATH=${CLASSPATH}:$f;
> >>>> done
> >>>>
> >>>> # add dev targets if they exist
> >>>> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do
> >>>>  CLASSPATH=${CLASSPATH}:$f;
> >>>> done
> >>>>
> >>>> # add release dependencies to CLASSPATH
> >>>> for f in $MAHOUT_HOME/lib/*.jar; do
> >>>>  CLASSPATH=${CLASSPATH}:$f;
> >>>> done
> >>>>
> >>>> From the looks of it, I could, on trunk, add in a lib directory and
> just
> >>>> shove my dependency into that dir.
> >>>>
> >>>> However, further down, we don't seem to use that CLASSPATH, except
> when
> >> in
> >>>> LOCAL mode or "hadoop" mode:
> >>>> if [ "$1" = "hadoop" ]; then
> >>>>    export
> >>>> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
> >>>>    exec "$HADOOP_HOME/bin/$@"
> >>>> else
> >>>>    echo "MAHOUT-JOB: $MAHOUT_JOB"
> >>>>    export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
> >>>>    exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar
> >>>> $MAHOUT_JOB $CLASS "$@"
> >>>> fi
> >>>>
> >>>> So this means, I should force "hadoop" mode by doing:
> >>>> ./bin/mahout hadoop
> >>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ...
> >>>> --analyzerName my.great.Analyzer
> >>>>
> >>>> instead of:
> >>>> ./bin/mahout seq2sparse ...
> >>>>
> >>>> However, I still get Class Not Found even though when I echo the
> >>>> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer.
> >>>>
> >>>> Any insight?
> >>>>
> >>>> --------------------------
> >>>> Grant Ingersoll
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >>
> >>
> >>
> >>
>
> --------------------------
> Grant Ingersoll
>
>
>
>

Reply via email to