bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and
hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME
------------------------------------------------------------------------------------------------------------------------------------------------------
Key: MAHOUT-800
URL: https://issues.apache.org/jira/browse/MAHOUT-800
Project: Mahout
Issue Type: Bug
Components: Examples, Integration
Environment: OSX; java version "1.6.0_26"
Reporter: Dan Brickley
Priority: Minor
(This began as a build-reuters.sh bug report, but the problem seemed deeper;
please excuse the narrative format here)
Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt cluster
mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/ directory,
because bin/mahout appends it to Java's classpath. This seems to trigger
something in Mahout Java that will to try to use the cluster, without this
being explicitly requested.
There have been reports (Jeff Eastman, myself;
http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=z0dy...@mail.gmail.com%3E
) of build-reuters.sh attempting cluster mode, even while claiming -
"MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant
conditions, "no HADOOP_HOME set, running locally").
Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with a
pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems to
be the key.
When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster
is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh
tries to use the cluster. Aside: this is not the same as it using
non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10 INFO
ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried
1 time(s)." unless the cluster is up. If the cluster is up and accessible, I'll
see java.io.IOException instead, presumably since the files aren't there.
If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda
modes) runs OK without real Hadoop.
If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems
fine. Only when it finds a Hadoop installation does it get confused.
Minimally I'd consider this a documentation issue. Nothing in build_reuters.sh
script mentions role of HADOOP_CONF_DIR. Reading build-reuters.sh I get the
impression both clustered and local modes are possible; however mailing list
discussion leave me ensure whether clustered mode is still supposed to work in
trunk.
Tests: (with no HADOOP_HOME set)
Running these extracts from build-reuters.sh in examples/bin/ after having
previously run build-reuters.sh to fetch data...
#this one runs OK
MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
-i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8
-chunk 5
# this fails (assuming there's a Hadoop there) by attempting clustered mode:
'Call to localhost/127.0.0.1:9000 failed...'
MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf
../../bin/mahout seqdirectory \
-i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8
-chunk 5
Same thing with seq2sparse
#fails, localhost:9000
HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true
../../bin/mahout seq2sparse \
-i mahout-work/reuters-out-seqdir/ -o
mahout-work/reuters-out-seqdir-sparse-kmeans
# runs locally just fine (because of bad hadoop conf path)
HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf
MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
-i mahout-work/reuters-out-seqdir/ -o
mahout-work/reuters-out-seqdir-sparse-kmeans
I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems
general, not driver-specific.
All this seems to contradict the notes in ../../bin/mahout, i.e.
# MAHOUT_LOCAL set to anything other than an empty string to force
# mahout to run locally even if
# HADOOP_CONF_DIR and HADOOP_HOME are set
Digging into bin/mahout it seems the accidental clustering happens deeper into
java-land, not in the .sh; it's not invoking hadoop directly there. We get this
far:
exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
I compared the Java commandlines generated by successful vs
accidentally-cluster-invoking runs of bin/mahout ...it seems the only
difference is whether a hadoop conf directory is on the classpath that's passed
to Java.
If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run
MAHOUT_LOCAL=true ../../bin/mahout kmeans \
-i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
-c mahout-work/reuters-kmeans-clusters \
-o mahout-work/reuters-kmeans \
-x 10 -k 20 -ow
...against an edited version of bin/mahout that appends a hadoop conf dir to
the classpath, i.e.
exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath
"$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"
This is enough to get "Exception in thread "main" java.io.IOException: Call to
localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"
(...and if I remove the /conf path from classpath, we're back to expected
behaviours).
Not sure whether it's best to patch this in bin/mahout, or in the Java (perhaps
the former might mask issues that'll cause later confusion?)
Perhaps only do
CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
if we're not seeing MAHOUT_LOCAL?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira