[jira] [Updated] (MAHOUT-800) bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME

Dan Brickley (JIRA) Fri, 02 Sep 2011 01:58:46 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dan Brickley updated MAHOUT-800:
--------------------------------

    Attachment: MAHOUT-800.patch

The simplest thing that made the problem go away.

> bin/mahout attempts cluster mode  if HADOOP_CONF_DIR is set plausibly (and 
> hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-800
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-800
>             Project: Mahout
>          Issue Type: Bug
>          Components: Examples, Integration
>         Environment: OSX; java version "1.6.0_26"
>            Reporter: Dan Brickley
>            Priority: Minor
>         Attachments: MAHOUT-800.patch
>
>
> (This began as a build-reuters.sh bug report, but the problem seemed deeper; 
> please excuse the narrative format here)
> Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt 
> cluster mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/ 
> directory, because bin/mahout appends it to Java's classpath. This seems to 
> trigger something in Mahout Java that will to try to use the cluster, without 
> this being explicitly requested.
> There have been reports (Jeff Eastman, myself; 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=z0dy...@mail.gmail.com%3E
>  ) of build-reuters.sh attempting cluster mode, even while claiming - 
> "MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant 
> conditions, "no HADOOP_HOME set, running locally").
> Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with 
> a pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems 
> to be the key.
> When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster 
> is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh 
> tries to use the cluster. Aside: this is not the same as it using 
> non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10 
> INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. 
> Already tried 1 time(s)." unless the cluster is up. If the cluster is up and 
> accessible, I'll see java.io.IOException instead, presumably since the files 
> aren't there.
> If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda 
> modes) runs OK without real Hadoop.
> If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems 
> fine. Only when it finds a Hadoop installation does it get confused.
> Minimally I'd consider this a documentation issue. Nothing in 
> build_reuters.sh script mentions role of HADOOP_CONF_DIR. Reading 
> build-reuters.sh I get the impression both clustered and local modes are 
> possible; however mailing list discussion leave me ensure whether clustered 
> mode is still supposed to work in trunk.
> Tests: (with no HADOOP_HOME set)
> Running these extracts from build-reuters.sh in examples/bin/ after having 
> previously run build-reuters.sh to fetch data...
> #this one runs OK
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
>         -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
> -chunk 5
> # this fails (assuming there's a Hadoop there) by attempting clustered mode: 
> 'Call to localhost/127.0.0.1:9000 failed...'
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf 
> ../../bin/mahout seqdirectory \
>         -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
> -chunk 5
> Same thing with seq2sparse
> #fails, localhost:9000
> HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true 
> ../../bin/mahout seq2sparse \
>     -i mahout-work/reuters-out-seqdir/ -o 
> mahout-work/reuters-out-seqdir-sparse-kmeans
> # runs locally just fine (because of bad hadoop conf path)
> HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf 
> MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
>     -i mahout-work/reuters-out-seqdir/ -o 
> mahout-work/reuters-out-seqdir-sparse-kmeans
> I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems 
> general, not driver-specific. 
> All this seems to contradict the notes in ../../bin/mahout, i.e.
> #   MAHOUT_LOCAL       set to anything other than an empty string to force
> #                      mahout to run locally even if
> #                      HADOOP_CONF_DIR and HADOOP_HOME are set
> Digging into bin/mahout it seems the accidental clustering happens deeper 
> into java-land, not in the .sh; it's not invoking hadoop directly there. We 
> get this far:
>   exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
> I compared the Java commandlines generated by successful vs 
> accidentally-cluster-invoking runs of bin/mahout ...it seems the only 
> difference is whether a hadoop conf directory is on the classpath that's 
> passed to Java.
> If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run 
> MAHOUT_LOCAL=true ../../bin/mahout kmeans \
>     -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
>     -c mahout-work/reuters-kmeans-clusters \
>     -o mahout-work/reuters-kmeans \
>     -x 10 -k 20 -ow
> ...against an edited version of bin/mahout that appends a hadoop conf dir to 
> the classpath, i.e.
>   exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath 
> "$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"
> This is enough to get "Exception in thread "main" java.io.IOException: Call 
> to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"
> (...and if I remove the /conf path from classpath, we're back to expected 
> behaviours).
> Not sure whether it's best to patch this in bin/mahout, or in the Java 
> (perhaps the former might mask issues that'll cause later confusion?)
> Perhaps only do 
>   CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
> if we're not seeing MAHOUT_LOCAL?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-800) bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME

Reply via email to