Yeah, I ended up creating an alternate Jar, but I also don't know that our 
script is doing as it is supposed to here.  Or, I guess better said, it would 
be desirable if we were able to make this easier for people.

-Grant

On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote:

> I have faced this problem in the past, the solution was to add the analyzer
> jar to the job's jar [1] in order to have the analyzer installed in the
> cluster nodes.
> 
> [1]
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
> 
> On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected]>wrote:
> 
>> I'm trying to understand a bit what our preferred mechanism is for users to
>> add custom libraries to the Mahout classpath when running on Hadoop.  The
>> obvious case that comes to mind is adding your own Lucene Analyzer, which is
>> what I am trying to do.
>> 
>> In looking at bin/mahout, we define CLASSPATH, in the non-core case to be:
>> # add release dependencies to CLASSPATH
>> for f in $MAHOUT_HOME/mahout-*.jar; do
>>   CLASSPATH=${CLASSPATH}:$f;
>> done
>> 
>> # add dev targets if they exist
>> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do
>>   CLASSPATH=${CLASSPATH}:$f;
>> done
>> 
>> # add release dependencies to CLASSPATH
>> for f in $MAHOUT_HOME/lib/*.jar; do
>>   CLASSPATH=${CLASSPATH}:$f;
>> done
>> 
>> From the looks of it, I could, on trunk, add in a lib directory and just
>> shove my dependency into that dir.
>> 
>> However, further down, we don't seem to use that CLASSPATH, except when in
>> LOCAL mode or "hadoop" mode:
>> if [ "$1" = "hadoop" ]; then
>>     export
>> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
>>     exec "$HADOOP_HOME/bin/$@"
>> else
>>     echo "MAHOUT-JOB: $MAHOUT_JOB"
>>     export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
>>     exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar
>> $MAHOUT_JOB $CLASS "$@"
>> fi
>> 
>> So this means, I should force "hadoop" mode by doing:
>> ./bin/mahout hadoop
>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ...
>> --analyzerName my.great.Analyzer
>> 
>> instead of:
>> ./bin/mahout seq2sparse ...
>> 
>> However, I still get Class Not Found even though when I echo the
>> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer.
>> 
>> Any insight?
>> 
>> --------------------------
>> Grant Ingersoll
>> 
>> 
>> 
>> 

--------------------------
Grant Ingersoll



Reply via email to