I can try it, but more importantly, should we hook it into bin/mahout?

-Grant

On Jul 21, 2011, at 12:29 PM, Jake Mannix wrote:

> This is one of the poster-child use cases for the -libjars flag to hadoop's
> shell script.  Have you tried to see if that works?
> 
>  -jake
> 
> On Thu, Jul 21, 2011 at 5:15 AM, Grant Ingersoll <[email protected]>wrote:
> 
>> Yeah, I ended up creating an alternate Jar, but I also don't know that our
>> script is doing as it is supposed to here.  Or, I guess better said, it
>> would be desirable if we were able to make this easier for people.
>> 
>> -Grant
>> 
>> On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote:
>> 
>>> I have faced this problem in the past, the solution was to add the
>> analyzer
>>> jar to the job's jar [1] in order to have the analyzer installed in the
>>> cluster nodes.
>>> 
>>> [1]
>>> 
>> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>>> 
>>> On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected]
>>> wrote:
>>> 
>>>> I'm trying to understand a bit what our preferred mechanism is for users
>> to
>>>> add custom libraries to the Mahout classpath when running on Hadoop.
>> The
>>>> obvious case that comes to mind is adding your own Lucene Analyzer,
>> which is
>>>> what I am trying to do.
>>>> 
>>>> In looking at bin/mahout, we define CLASSPATH, in the non-core case to
>> be:
>>>> # add release dependencies to CLASSPATH
>>>> for f in $MAHOUT_HOME/mahout-*.jar; do
>>>>  CLASSPATH=${CLASSPATH}:$f;
>>>> done
>>>> 
>>>> # add dev targets if they exist
>>>> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do
>>>>  CLASSPATH=${CLASSPATH}:$f;
>>>> done
>>>> 
>>>> # add release dependencies to CLASSPATH
>>>> for f in $MAHOUT_HOME/lib/*.jar; do
>>>>  CLASSPATH=${CLASSPATH}:$f;
>>>> done
>>>> 
>>>> From the looks of it, I could, on trunk, add in a lib directory and just
>>>> shove my dependency into that dir.
>>>> 
>>>> However, further down, we don't seem to use that CLASSPATH, except when
>> in
>>>> LOCAL mode or "hadoop" mode:
>>>> if [ "$1" = "hadoop" ]; then
>>>>    export
>>>> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
>>>>    exec "$HADOOP_HOME/bin/$@"
>>>> else
>>>>    echo "MAHOUT-JOB: $MAHOUT_JOB"
>>>>    export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
>>>>    exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar
>>>> $MAHOUT_JOB $CLASS "$@"
>>>> fi
>>>> 
>>>> So this means, I should force "hadoop" mode by doing:
>>>> ./bin/mahout hadoop
>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ...
>>>> --analyzerName my.great.Analyzer
>>>> 
>>>> instead of:
>>>> ./bin/mahout seq2sparse ...
>>>> 
>>>> However, I still get Class Not Found even though when I echo the
>>>> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer.
>>>> 
>>>> Any insight?
>>>> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> 
>> 
>> 
>> 

--------------------------
Grant Ingersoll



Reply via email to