We respect hadoop args, right? Or only the -D ones? We should support those in bin/mahout, yes.
On Thu, Jul 21, 2011 at 11:06 AM, Grant Ingersoll <[email protected]>wrote: > I can try it, but more importantly, should we hook it into bin/mahout? > > -Grant > > On Jul 21, 2011, at 12:29 PM, Jake Mannix wrote: > > > This is one of the poster-child use cases for the -libjars flag to > hadoop's > > shell script. Have you tried to see if that works? > > > > -jake > > > > On Thu, Jul 21, 2011 at 5:15 AM, Grant Ingersoll <[email protected] > >wrote: > > > >> Yeah, I ended up creating an alternate Jar, but I also don't know that > our > >> script is doing as it is supposed to here. Or, I guess better said, it > >> would be desirable if we were able to make this easier for people. > >> > >> -Grant > >> > >> On Jul 20, 2011, at 11:58 PM, Elmer Garduno wrote: > >> > >>> I have faced this problem in the past, the solution was to add the > >> analyzer > >>> jar to the job's jar [1] in order to have the analyzer installed in the > >>> cluster nodes. > >>> > >>> [1] > >>> > >> > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > >>> > >>> On Wed, Jul 20, 2011 at 10:53 AM, Grant Ingersoll <[email protected] > >>> wrote: > >>> > >>>> I'm trying to understand a bit what our preferred mechanism is for > users > >> to > >>>> add custom libraries to the Mahout classpath when running on Hadoop. > >> The > >>>> obvious case that comes to mind is adding your own Lucene Analyzer, > >> which is > >>>> what I am trying to do. > >>>> > >>>> In looking at bin/mahout, we define CLASSPATH, in the non-core case to > >> be: > >>>> # add release dependencies to CLASSPATH > >>>> for f in $MAHOUT_HOME/mahout-*.jar; do > >>>> CLASSPATH=${CLASSPATH}:$f; > >>>> done > >>>> > >>>> # add dev targets if they exist > >>>> for f in $MAHOUT_HOME/*/target/mahout-examples-*-job.jar; do > >>>> CLASSPATH=${CLASSPATH}:$f; > >>>> done > >>>> > >>>> # add release dependencies to CLASSPATH > >>>> for f in $MAHOUT_HOME/lib/*.jar; do > >>>> CLASSPATH=${CLASSPATH}:$f; > >>>> done > >>>> > >>>> From the looks of it, I could, on trunk, add in a lib directory and > just > >>>> shove my dependency into that dir. > >>>> > >>>> However, further down, we don't seem to use that CLASSPATH, except > when > >> in > >>>> LOCAL mode or "hadoop" mode: > >>>> if [ "$1" = "hadoop" ]; then > >>>> export > >>>> HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH > >>>> exec "$HADOOP_HOME/bin/$@" > >>>> else > >>>> echo "MAHOUT-JOB: $MAHOUT_JOB" > >>>> export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH} > >>>> exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar > >>>> $MAHOUT_JOB $CLASS "$@" > >>>> fi > >>>> > >>>> So this means, I should force "hadoop" mode by doing: > >>>> ./bin/mahout hadoop > >>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles ... > >>>> --analyzerName my.great.Analyzer > >>>> > >>>> instead of: > >>>> ./bin/mahout seq2sparse ... > >>>> > >>>> However, I still get Class Not Found even though when I echo the > >>>> $HADOOP_CLASSPATH my jar is in there and the jar contains my Analyzer. > >>>> > >>>> Any insight? > >>>> > >>>> -------------------------- > >>>> Grant Ingersoll > >>>> > >>>> > >>>> > >>>> > >> > >> -------------------------- > >> Grant Ingersoll > >> > >> > >> > >> > > -------------------------- > Grant Ingersoll > > > >
