[
https://issues.apache.org/jira/browse/MAHOUT-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240031#comment-13240031
]
Roman Shaposhnik commented on MAHOUT-994:
-----------------------------------------
@Dmitriy
bq. I am not sure we use hadoop executable to launch Mahout stuff.
Seems like you do:
{noformat}
(*)
echo "MAHOUT-JOB: $MAHOUT_JOB"
export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar $MAHOUT_JOB
$CLASS "$@"
{noformat}
bq. I mean, we are not bound by having to launch any MR stuff.
Understood. Pig also has local mode where it doesn't have to have any access to
Hadoop at all
bq. I think it's a little bit disconcerting as we have a lot of utility classes
that may be using hadoop classes even locally (e.g. reading or writing local
sequence files). Can you suggest a strategy for Mahout local mode and still be
able to bind to hadoop I/O classes?
I believe there are 2 issue at play here:
# what to do for the local mode
# what to do for the MapReduce mode
For both cases you'll have to make sure that you segregate bundled hadoop
dependencies into a separate location in your installation tree so that
you can easily add/remove those from your CLASSPATH. Lets call it
$HADOOP_DEPS_DIR
(it must be different from your ususal lib subdir)
1. Local mode:
The very fact that you're running in local mode will be governed by
$MAHOUT_LOCAL
(regardless of whether HADOOP_HOME is empty or not).
When $MAHOUT_LOCAL is set you'll try to see whether you can run 'hadoop
classpath'
(see how HADOOP_HOME/HADOOP_PREFIX factors into this below). If you can you'll
add
the value it returns to the CLASSPATH. If it is not available you'll add
$HADOOP_DEPS_DIR
to the classpath.
2. MapReduce mode
For the MapReduce mode, I would argue that the best option is to follow the
suit of Pig/Hive/HBase, etc and rely on HADOOP_HOME *only* as a fall back value
if hadoop is not resolved from the PATH by default. E.g.
{noformat}
HADOOP_LAUNCHER=$(PATH="${HADOOP_HOME:-${HADOOP_PREFIX}}/bin:$PATH" which
hadoop 2>/dev/null)
{noformat}
then you'll need to replace all the explicit calls to $HADOOP_HOME/bin/... with
the $HADOOP_LAUNCHER. I will also strongly recommend against guessing the value
of HADOOP_CONF_DIR, since I don't think it buys you anything at all (the hadoop
launcher script regardless of whether it is coming from honors HADOOP_CONF_DIR
as much as it does --config).
That will take care of the MapReduce case.
If you agree with the proposal I can supply a patch shortly.
> mahout script shouldn't rely on HADOOP_HOME since that was deprecated in all
> major Hadoop branches
> --------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-994
> URL: https://issues.apache.org/jira/browse/MAHOUT-994
> Project: Mahout
> Issue Type: Bug
> Components: Integration
> Affects Versions: 0.6
> Reporter: Roman Shaposhnik
>
> Mahout should follow the Pig and Hive example and not rely explicitly on
> HADOOP_HOME and HADOOP_CONF_DIR
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira