[jira] [Updated] (MAHOUT-800) bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME

Dan Brickley (JIRA) Fri, 02 Sep 2011 02:03:02 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dan Brickley updated MAHOUT-800:
--------------------------------

    Description: 
(This began as a build-reuters.sh bug report, but the problem seemed deeper; 
please excuse the narrative format here)

Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt cluster 
mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/ directory, 
because bin/mahout appends it to Java's classpath. This seems to trigger 
something in Mahout Java that will to try to use the cluster, without this 
being explicitly requested.

There have been reports (Jeff Eastman, myself; 
http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=z0dy...@mail.gmail.com%3E
 ) of build-reuters.sh attempting cluster mode, even while claiming - 
"MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant 
conditions, "no HADOOP_HOME set, running locally").

Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with a 
pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems to 
be the key.

When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster 
is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh 
tries to use the cluster. Aside: this is not the same as it using 
non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10 INFO 
ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 
1 time(s)." unless the cluster is up. If the cluster is up and accessible, I'll 
see java.io.IOException instead, presumably since the files aren't there.

If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda 
modes) runs OK without real Hadoop.

If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems 
fine. Only when it finds a Hadoop installation does it get confused.

Minimally I'd consider this a documentation issue. Nothing in build_reuters.sh 
script mentions role of HADOOP_CONF_DIR. Reading build-reuters.sh I get the 
impression both clustered and local modes are possible; however mailing list 
discussion leave me ensure whether clustered mode is still supposed to work in 
trunk.


Tests: (with no HADOOP_HOME set)

Running these extracts from build-reuters.sh in examples/bin/ after having 
previously run build-reuters.sh to fetch data...

#this one runs OK
MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
        -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
-chunk 5

#this fails (assuming there's a Hadoop there) by attempting clustered mode: 
'Call to localhost/127.0.0.1:9000 failed...'

MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf 
../../bin/mahout seqdirectory \
        -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
-chunk 5


Same thing with seq2sparse

#fails, localhost:9000
HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true 
../../bin/mahout seq2sparse \
    -i mahout-work/reuters-out-seqdir/ -o 
mahout-work/reuters-out-seqdir-sparse-kmeans

# runs locally just fine (because of bad hadoop conf path)
HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf 
MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
    -i mahout-work/reuters-out-seqdir/ -o 
mahout-work/reuters-out-seqdir-sparse-kmeans

I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems 
general, not driver-specific. 

All this seems to contradict the notes in ../../bin/mahout, i.e.

#   MAHOUT_LOCAL       set to anything other than an empty string to force
#                      mahout to run locally even if
#                      HADOOP_CONF_DIR and HADOOP_HOME are set



Digging into bin/mahout it seems the accidental clustering happens deeper into 
java-land, not in the .sh; it's not invoking hadoop directly there. We get this 
far:

  exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"

I compared the Java commandlines generated by successful vs 
accidentally-cluster-invoking runs of bin/mahout ...it seems the only 
difference is whether a hadoop conf directory is on the classpath that's passed 
to Java.

If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run 

MAHOUT_LOCAL=true ../../bin/mahout kmeans \
    -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
    -c mahout-work/reuters-kmeans-clusters \
    -o mahout-work/reuters-kmeans \
    -x 10 -k 20 -ow

...against an edited version of bin/mahout that appends a hadoop conf dir to 
the classpath, i.e.

  exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath 
"$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"

This is enough to get "Exception in thread "main" java.io.IOException: Call to 
localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"

(...and if I remove the /conf path from classpath, we're back to expected 
behaviours).

Not sure whether it's best to patch this in bin/mahout, or in the Java (perhaps 
the former might mask issues that'll cause later confusion?)

Perhaps only do 

  CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR

if we're not seeing MAHOUT_LOCAL?

  was:
(This began as a build-reuters.sh bug report, but the problem seemed deeper; 
please excuse the narrative format here)

Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt cluster 
mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/ directory, 
because bin/mahout appends it to Java's classpath. This seems to trigger 
something in Mahout Java that will to try to use the cluster, without this 
being explicitly requested.

There have been reports (Jeff Eastman, myself; 
http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=z0dy...@mail.gmail.com%3E
 ) of build-reuters.sh attempting cluster mode, even while claiming - 
"MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant 
conditions, "no HADOOP_HOME set, running locally").

Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with a 
pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems to 
be the key.

When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster 
is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh 
tries to use the cluster. Aside: this is not the same as it using 
non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10 INFO 
ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 
1 time(s)." unless the cluster is up. If the cluster is up and accessible, I'll 
see java.io.IOException instead, presumably since the files aren't there.

If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda 
modes) runs OK without real Hadoop.

If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems 
fine. Only when it finds a Hadoop installation does it get confused.

Minimally I'd consider this a documentation issue. Nothing in build_reuters.sh 
script mentions role of HADOOP_CONF_DIR. Reading build-reuters.sh I get the 
impression both clustered and local modes are possible; however mailing list 
discussion leave me ensure whether clustered mode is still supposed to work in 
trunk.


Tests: (with no HADOOP_HOME set)

Running these extracts from build-reuters.sh in examples/bin/ after having 
previously run build-reuters.sh to fetch data...

#this one runs OK
MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
        -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
-chunk 5

# this fails (assuming there's a Hadoop there) by attempting clustered mode: 
'Call to localhost/127.0.0.1:9000 failed...'

MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf 
../../bin/mahout seqdirectory \
        -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
-chunk 5


Same thing with seq2sparse

#fails, localhost:9000
HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true 
../../bin/mahout seq2sparse \
    -i mahout-work/reuters-out-seqdir/ -o 
mahout-work/reuters-out-seqdir-sparse-kmeans

# runs locally just fine (because of bad hadoop conf path)
HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf 
MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
    -i mahout-work/reuters-out-seqdir/ -o 
mahout-work/reuters-out-seqdir-sparse-kmeans

I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems 
general, not driver-specific. 

All this seems to contradict the notes in ../../bin/mahout, i.e.

#   MAHOUT_LOCAL       set to anything other than an empty string to force
#                      mahout to run locally even if
#                      HADOOP_CONF_DIR and HADOOP_HOME are set



Digging into bin/mahout it seems the accidental clustering happens deeper into 
java-land, not in the .sh; it's not invoking hadoop directly there. We get this 
far:

  exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"

I compared the Java commandlines generated by successful vs 
accidentally-cluster-invoking runs of bin/mahout ...it seems the only 
difference is whether a hadoop conf directory is on the classpath that's passed 
to Java.

If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run 

MAHOUT_LOCAL=true ../../bin/mahout kmeans \
    -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
    -c mahout-work/reuters-kmeans-clusters \
    -o mahout-work/reuters-kmeans \
    -x 10 -k 20 -ow

...against an edited version of bin/mahout that appends a hadoop conf dir to 
the classpath, i.e.

  exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath 
"$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"

This is enough to get "Exception in thread "main" java.io.IOException: Call to 
localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"

(...and if I remove the /conf path from classpath, we're back to expected 
behaviours).

Not sure whether it's best to patch this in bin/mahout, or in the Java (perhaps 
the former might mask issues that'll cause later confusion?)

Perhaps only do 

  CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR

if we're not seeing MAHOUT_LOCAL?


> bin/mahout attempts cluster mode  if HADOOP_CONF_DIR is set plausibly (and 
> hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-800
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-800
>             Project: Mahout
>          Issue Type: Bug
>          Components: Examples, Integration
>         Environment: OSX; java version "1.6.0_26"
>            Reporter: Dan Brickley
>            Priority: Minor
>         Attachments: MAHOUT-800.patch
>
>
> (This began as a build-reuters.sh bug report, but the problem seemed deeper; 
> please excuse the narrative format here)
> Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt 
> cluster mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/ 
> directory, because bin/mahout appends it to Java's classpath. This seems to 
> trigger something in Mahout Java that will to try to use the cluster, without 
> this being explicitly requested.
> There have been reports (Jeff Eastman, myself; 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=z0dy...@mail.gmail.com%3E
>  ) of build-reuters.sh attempting cluster mode, even while claiming - 
> "MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant 
> conditions, "no HADOOP_HOME set, running locally").
> Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with 
> a pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems 
> to be the key.
> When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster 
> is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh 
> tries to use the cluster. Aside: this is not the same as it using 
> non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10 
> INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. 
> Already tried 1 time(s)." unless the cluster is up. If the cluster is up and 
> accessible, I'll see java.io.IOException instead, presumably since the files 
> aren't there.
> If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda 
> modes) runs OK without real Hadoop.
> If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems 
> fine. Only when it finds a Hadoop installation does it get confused.
> Minimally I'd consider this a documentation issue. Nothing in 
> build_reuters.sh script mentions role of HADOOP_CONF_DIR. Reading 
> build-reuters.sh I get the impression both clustered and local modes are 
> possible; however mailing list discussion leave me ensure whether clustered 
> mode is still supposed to work in trunk.
> Tests: (with no HADOOP_HOME set)
> Running these extracts from build-reuters.sh in examples/bin/ after having 
> previously run build-reuters.sh to fetch data...
> #this one runs OK
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
>         -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
> -chunk 5
> #this fails (assuming there's a Hadoop there) by attempting clustered mode: 
> 'Call to localhost/127.0.0.1:9000 failed...'
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf 
> ../../bin/mahout seqdirectory \
>         -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 
> -chunk 5
> Same thing with seq2sparse
> #fails, localhost:9000
> HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true 
> ../../bin/mahout seq2sparse \
>     -i mahout-work/reuters-out-seqdir/ -o 
> mahout-work/reuters-out-seqdir-sparse-kmeans
> # runs locally just fine (because of bad hadoop conf path)
> HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf 
> MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
>     -i mahout-work/reuters-out-seqdir/ -o 
> mahout-work/reuters-out-seqdir-sparse-kmeans
> I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems 
> general, not driver-specific. 
> All this seems to contradict the notes in ../../bin/mahout, i.e.
> #   MAHOUT_LOCAL       set to anything other than an empty string to force
> #                      mahout to run locally even if
> #                      HADOOP_CONF_DIR and HADOOP_HOME are set
> Digging into bin/mahout it seems the accidental clustering happens deeper 
> into java-land, not in the .sh; it's not invoking hadoop directly there. We 
> get this far:
>   exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
> I compared the Java commandlines generated by successful vs 
> accidentally-cluster-invoking runs of bin/mahout ...it seems the only 
> difference is whether a hadoop conf directory is on the classpath that's 
> passed to Java.
> If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run 
> MAHOUT_LOCAL=true ../../bin/mahout kmeans \
>     -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
>     -c mahout-work/reuters-kmeans-clusters \
>     -o mahout-work/reuters-kmeans \
>     -x 10 -k 20 -ow
> ...against an edited version of bin/mahout that appends a hadoop conf dir to 
> the classpath, i.e.
>   exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath 
> "$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"
> This is enough to get "Exception in thread "main" java.io.IOException: Call 
> to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"
> (...and if I remove the /conf path from classpath, we're back to expected 
> behaviours).
> Not sure whether it's best to patch this in bin/mahout, or in the Java 
> (perhaps the former might mask issues that'll cause later confusion?)
> Perhaps only do 
>   CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
> if we're not seeing MAHOUT_LOCAL?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-800) bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME

Reply via email to