I'm trying to understand a bit what our preferred mechanism is for users to add
custom libraries to the Mahout classpath when running on Hadoop. The obvious
case that comes to mind is adding your own Lucene Analyzer, which is what I am
trying to do.
In looking at bin/mahout, we define CLASSPATH, in the non-core case to be:
# add release dependencies to CLASSPATH
for f in $MAHOUT_HOME/mahout-*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# add dev targets if they exist
for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# add release dependencies to CLASSPATH
for f in $MAHOUT_HOME/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
From the looks of it, I could, on trunk, add in a lib directory and just shove
my dependency into that dir.
However, further down, we don't seem to use that CLASSPATH, except when in
LOCAL mode or "hadoop" mode:
if [ "$1" = "hadoop" ]; then
export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
exec "$HADOOP_HOME/bin/$@"
else
echo "MAHOUT-JOB: $MAHOUT_JOB"
export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar $MAHOUT_JOB
$CLASS "$@"
fi
So this means, I should force "hadoop" mode by doing:
./bin/mahout hadoop org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles
... --analyzerName my.great.Analyzer
instead of:
./bin/mahout seq2sparse ...
However, I still get Class Not Found even though when I echo the
$HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer.
Any insight?
--------------------------
Grant Ingersoll