I can try it, but more importantly, should we hook it into bin/mahout? -Grant
On Jul 21, 2011, at 12:29 PM, Jake Mannix wrote: > This is one of the poster-child use cases for the -libjars flag to hadoop's > shell script. Have you tried to see if that works? > > -jake > > On Thu, Jul 21, 2011 at 5:15 AM, Grant Ingersoll <[email protected]>wrote: > >> Yeah, I ended up creating an alternate Jar, but I also don't know that our >> script is doing as it is supposed to here. Or, I guess better said, it >> would be desirable if we were able to make this easier for people. >> >> -Grant >> >> On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote: >> >>> I have faced this problem in the past, the solution was to add the >> analyzer >>> jar to the job's jar [1] in order to have the analyzer installed in the >>> cluster nodes. >>> >>> [1] >>> >> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ >>> >>> On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected] >>> wrote: >>> >>>> I'm trying to understand a bit what our preferred mechanism is for users >> to >>>> add custom libraries to the Mahout classpath when running on Hadoop. >> The >>>> obvious case that comes to mind is adding your own Lucene Analyzer, >> which is >>>> what I am trying to do. >>>> >>>> In looking at bin/mahout, we define CLASSPATH, in the non-core case to >> be: >>>> # add release dependencies to CLASSPATH >>>> for f in $MAHOUT_HOME/mahout-*.jar; do >>>> CLASSPATH=${CLASSPATH}:$f; >>>> done >>>> >>>> # add dev targets if they exist >>>> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do >>>> CLASSPATH=${CLASSPATH}:$f; >>>> done >>>> >>>> # add release dependencies to CLASSPATH >>>> for f in $MAHOUT_HOME/lib/*.jar; do >>>> CLASSPATH=${CLASSPATH}:$f; >>>> done >>>> >>>> From the looks of it, I could, on trunk, add in a lib directory and just >>>> shove my dependency into that dir. >>>> >>>> However, further down, we don't seem to use that CLASSPATH, except when >> in >>>> LOCAL mode or "hadoop" mode: >>>> if [ "$1" = "hadoop" ]; then >>>> export >>>> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH >>>> exec "$HADOOP_HOME/bin/$@" >>>> else >>>> echo "MAHOUT-JOB: $MAHOUT_JOB" >>>> export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH} >>>> exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar >>>> $MAHOUT_JOB $CLASS "$@" >>>> fi >>>> >>>> So this means, I should force "hadoop" mode by doing: >>>> ./bin/mahout hadoop >>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ... >>>> --analyzerName my.great.Analyzer >>>> >>>> instead of: >>>> ./bin/mahout seq2sparse ... >>>> >>>> However, I still get Class Not Found even though when I echo the >>>> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer. >>>> >>>> Any insight? >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> >>>> >>>> >>>> >> >> -------------------------- >> Grant Ingersoll >> >> >> >> -------------------------- Grant Ingersoll
