[
https://issues.apache.org/jira/browse/MAHOUT-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dan Brickley updated MAHOUT-800:
--------------------------------
Attachment: MAHOUT-800.patch
The simplest thing that made the problem go away.
> bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and
> hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-800
> URL: https://issues.apache.org/jira/browse/MAHOUT-800
> Project: Mahout
> Issue Type: Bug
> Components: Examples, Integration
> Environment: OSX; java version "1.6.0_26"
> Reporter: Dan Brickley
> Priority: Minor
> Attachments: MAHOUT-800.patch
>
>
> (This began as a build-reuters.sh bug report, but the problem seemed deeper;
> please excuse the narrative format here)
> Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt
> cluster mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/
> directory, because bin/mahout appends it to Java's classpath. This seems to
> trigger something in Mahout Java that will to try to use the cluster, without
> this being explicitly requested.
> There have been reports (Jeff Eastman, myself;
> http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=z0dy...@mail.gmail.com%3E
> ) of build-reuters.sh attempting cluster mode, even while claiming -
> "MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant
> conditions, "no HADOOP_HOME set, running locally").
> Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with
> a pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems
> to be the key.
> When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster
> is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh
> tries to use the cluster. Aside: this is not the same as it using
> non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10
> INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000.
> Already tried 1 time(s)." unless the cluster is up. If the cluster is up and
> accessible, I'll see java.io.IOException instead, presumably since the files
> aren't there.
> If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda
> modes) runs OK without real Hadoop.
> If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems
> fine. Only when it finds a Hadoop installation does it get confused.
> Minimally I'd consider this a documentation issue. Nothing in
> build_reuters.sh script mentions role of HADOOP_CONF_DIR. Reading
> build-reuters.sh I get the impression both clustered and local modes are
> possible; however mailing list discussion leave me ensure whether clustered
> mode is still supposed to work in trunk.
> Tests: (with no HADOOP_HOME set)
> Running these extracts from build-reuters.sh in examples/bin/ after having
> previously run build-reuters.sh to fetch data...
> #this one runs OK
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
> -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8
> -chunk 5
> # this fails (assuming there's a Hadoop there) by attempting clustered mode:
> 'Call to localhost/127.0.0.1:9000 failed...'
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf
> ../../bin/mahout seqdirectory \
> -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8
> -chunk 5
> Same thing with seq2sparse
> #fails, localhost:9000
> HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true
> ../../bin/mahout seq2sparse \
> -i mahout-work/reuters-out-seqdir/ -o
> mahout-work/reuters-out-seqdir-sparse-kmeans
> # runs locally just fine (because of bad hadoop conf path)
> HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf
> MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
> -i mahout-work/reuters-out-seqdir/ -o
> mahout-work/reuters-out-seqdir-sparse-kmeans
> I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems
> general, not driver-specific.
> All this seems to contradict the notes in ../../bin/mahout, i.e.
> # MAHOUT_LOCAL set to anything other than an empty string to force
> # mahout to run locally even if
> # HADOOP_CONF_DIR and HADOOP_HOME are set
> Digging into bin/mahout it seems the accidental clustering happens deeper
> into java-land, not in the .sh; it's not invoking hadoop directly there. We
> get this far:
> exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
> I compared the Java commandlines generated by successful vs
> accidentally-cluster-invoking runs of bin/mahout ...it seems the only
> difference is whether a hadoop conf directory is on the classpath that's
> passed to Java.
> If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run
> MAHOUT_LOCAL=true ../../bin/mahout kmeans \
> -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
> -c mahout-work/reuters-kmeans-clusters \
> -o mahout-work/reuters-kmeans \
> -x 10 -k 20 -ow
> ...against an edited version of bin/mahout that appends a hadoop conf dir to
> the classpath, i.e.
> exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath
> "$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"
> This is enough to get "Exception in thread "main" java.io.IOException: Call
> to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"
> (...and if I remove the /conf path from classpath, we're back to expected
> behaviours).
> Not sure whether it's best to patch this in bin/mahout, or in the Java
> (perhaps the former might mask issues that'll cause later confusion?)
> Perhaps only do
> CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
> if we're not seeing MAHOUT_LOCAL?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira