[ 
https://issues.apache.org/jira/browse/MAHOUT-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240031#comment-13240031
 ] 

Roman Shaposhnik commented on MAHOUT-994:
-----------------------------------------

@Dmitriy

bq. I am not sure we use hadoop executable to launch Mahout stuff. 

Seems like you do: 
{noformat}
    (*)
      echo "MAHOUT-JOB: $MAHOUT_JOB"
      export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
      exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar $MAHOUT_JOB 
$CLASS "$@"
{noformat}

bq. I mean, we are not bound by having to launch any MR stuff.

Understood. Pig also has local mode where it doesn't have to have any access to 
Hadoop at all

bq. I think it's a little bit disconcerting as we have a lot of utility classes 
that may be using hadoop classes even locally (e.g. reading or writing local 
sequence files). Can you suggest a strategy for Mahout local mode and still be 
able to bind to hadoop I/O classes?

I believe there are 2 issue at play here:
  # what to do for the local mode
  # what to do for the MapReduce mode

For both cases you'll have to make sure that you segregate bundled hadoop
dependencies into a separate location in your installation tree so that
you can easily add/remove those from your CLASSPATH. Lets call it 
$HADOOP_DEPS_DIR
(it must be different from your ususal lib subdir)

1. Local mode:

The very fact that you're running in local mode will be governed by 
$MAHOUT_LOCAL
(regardless of whether HADOOP_HOME is empty or not). 

When $MAHOUT_LOCAL is set you'll try to see whether you can run 'hadoop 
classpath'
(see how HADOOP_HOME/HADOOP_PREFIX factors into this below). If you can you'll 
add 
the value it returns to the CLASSPATH. If it is not available you'll add 
$HADOOP_DEPS_DIR 
to the classpath.

2. MapReduce mode

For the MapReduce mode, I would argue that the best option is to follow the 
suit of Pig/Hive/HBase, etc and rely on HADOOP_HOME *only* as a fall back value 
if hadoop is not resolved from the PATH by default. E.g.
{noformat}
HADOOP_LAUNCHER=$(PATH="${HADOOP_HOME:-${HADOOP_PREFIX}}/bin:$PATH" which 
hadoop 2>/dev/null)
{noformat}

then you'll need to replace all the explicit calls to $HADOOP_HOME/bin/... with 
the $HADOOP_LAUNCHER. I will also strongly recommend against guessing the value 
of HADOOP_CONF_DIR, since I don't think it buys you anything at all (the hadoop 
launcher script regardless of whether it is coming from honors HADOOP_CONF_DIR 
as much as it does --config).

That will take care of the MapReduce case.

If you agree with the proposal I can supply a patch shortly.
                
> mahout script shouldn't rely on HADOOP_HOME since that was deprecated in all 
> major Hadoop branches
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-994
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-994
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>    Affects Versions: 0.6
>            Reporter: Roman Shaposhnik
>
> Mahout should follow the Pig and Hive example and not rely explicitly on 
> HADOOP_HOME and HADOOP_CONF_DIR

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to