I'm trying to understand a bit what our preferred mechanism is for users to add 
custom libraries to the Mahout classpath when running on Hadoop.  The obvious 
case that comes to mind is adding your own Lucene Analyzer, which is what I am 
trying to do.

In looking at bin/mahout, we define CLASSPATH, in the non-core case to be:
# add release dependencies to CLASSPATH
  for f in $MAHOUT_HOME/mahout-*.jar; do
    CLASSPATH=${CLASSPATH}:$f;
  done

  # add dev targets if they exist
  for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do
    CLASSPATH=${CLASSPATH}:$f;
  done

  # add release dependencies to CLASSPATH
  for f in $MAHOUT_HOME/lib/*.jar; do
    CLASSPATH=${CLASSPATH}:$f;
  done

From the looks of it, I could, on trunk, add in a lib directory and just shove 
my dependency into that dir.

However, further down, we don't seem to use that CLASSPATH, except when in 
LOCAL mode or "hadoop" mode:
if [ "$1" = "hadoop" ]; then
      export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
      exec "$HADOOP_HOME/bin/$@"
else
      echo "MAHOUT-JOB: $MAHOUT_JOB"
      export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
      exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar $MAHOUT_JOB 
$CLASS "$@"
fi

So this means, I should force "hadoop" mode by doing:
./bin/mahout hadoop org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles 
... --analyzerName my.great.Analyzer

instead of:
./bin/mahout seq2sparse ...

However, I still get Class Not Found even though when I echo the 
$HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer.

Any insight?

--------------------------
Grant Ingersoll



Reply via email to