Yeah, I ended up creating an alternate Jar, but I also don't know that our script is doing as it is supposed to here. Or, I guess better said, it would be desirable if we were able to make this easier for people.
-Grant On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote: > I have faced this problem in the past, the solution was to add the analyzer > jar to the job's jar [1] in order to have the analyzer installed in the > cluster nodes. > > [1] > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > > On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected]>wrote: > >> I'm trying to understand a bit what our preferred mechanism is for users to >> add custom libraries to the Mahout classpath when running on Hadoop. The >> obvious case that comes to mind is adding your own Lucene Analyzer, which is >> what I am trying to do. >> >> In looking at bin/mahout, we define CLASSPATH, in the non-core case to be: >> # add release dependencies to CLASSPATH >> for f in $MAHOUT_HOME/mahout-*.jar; do >> CLASSPATH=${CLASSPATH}:$f; >> done >> >> # add dev targets if they exist >> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do >> CLASSPATH=${CLASSPATH}:$f; >> done >> >> # add release dependencies to CLASSPATH >> for f in $MAHOUT_HOME/lib/*.jar; do >> CLASSPATH=${CLASSPATH}:$f; >> done >> >> From the looks of it, I could, on trunk, add in a lib directory and just >> shove my dependency into that dir. >> >> However, further down, we don't seem to use that CLASSPATH, except when in >> LOCAL mode or "hadoop" mode: >> if [ "$1" = "hadoop" ]; then >> export >> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH >> exec "$HADOOP_HOME/bin/$@" >> else >> echo "MAHOUT-JOB: $MAHOUT_JOB" >> export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH} >> exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar >> $MAHOUT_JOB $CLASS "$@" >> fi >> >> So this means, I should force "hadoop" mode by doing: >> ./bin/mahout hadoop >> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ... >> --analyzerName my.great.Analyzer >> >> instead of: >> ./bin/mahout seq2sparse ... >> >> However, I still get Class Not Found even though when I echo the >> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer. >> >> Any insight? >> >> -------------------------- >> Grant Ingersoll >> >> >> >> -------------------------- Grant Ingersoll
