hadoop job config parameter,e.g., -Dmapred.cache.archives, support in mahout
wrapper
------------------------------------------------------------------------------------
Key: MAHOUT-573
URL: https://issues.apache.org/jira/browse/MAHOUT-573
Project: Mahout
Issue Type: Improvement
Components: Utils
Affects Versions: 0.4
Environment: fedora 14 running on VirtualBox for Windows
Windows Vespa
Reporter: Shige Takeda
Priority: Minor
In order to specify a custom analyzer that utilizes a Japanese Morphological
Analyzer "Igo" referring to dictionary files on HDFS for seq2sparse, I needed
to pass the following job config:
mapred.cache.archives="hdfs://localhost:9000/user/stakeda/ipadic.zip#ipadic
mapred.create.symlink=yes
This way, the IgoAnalyzer can read dictionaries from "./ipadic" as follows:
https://github.com/smtakeda/mahout/blob/project101210/examples/src/main/java/org/apache/mahout/analysis/IgoAnalyzer.java
Other use case is I needed to specify mapred.job.queue.name to something to get
appropriate priority for running jobs in the work environment:
https://github.com/smtakeda/mahout/blob/yahoo/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java
...
conf.set("mapred.job.queue.name", "unfunded");
Based on these two use cases, I would like to request/propose to add hadoop job
option support, i.e., -Dmapred.cache.archives=... to mahout wrapper.
Changes are roughly expected in two ends; "bin/mahout" and all main functions
that parse command lines. Here is a quick patch for "bin/mahout":
localhost ~/workspace/mahout_git/bin: git diff -r
f13e517408f20f75009e05e6c72c5fbb836e3f66 mahout
diff --git a/bin/mahout b/bin/mahout
index 774fa11..9d78ceb 100755
--- a/bin/mahout
+++ b/bin/mahout
@@ -116,6 +116,14 @@ CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
# so that filenames w/ spaces are handled correctly in loops below
IFS=
+
+# JAVA_PROPERTIES
+JAVA_PROPERTIES=
+while [ $1 ] && [ ${1:0:2} == "-D" ] ; do
+ JAVA_PROPERTIES="$1 $JAVA_PROPERTIES"
+ shift
+done
+
if [ $IS_CORE == 0 ]
then
# add release dependencies to CLASSPATH
@@ -198,7 +206,7 @@ if [ "$HADOOP_HOME" = "" ] || [ "$MAHOUT_LOCAL" != "" ] ;
then
elif [ "$MAHOUT_LOCAL" != "" ] ; then
echo "MAHOUT_LOCAL is set, running locally"
fi
- exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
+ exec "$JAVA" $JAVA_HEAP_MAX $JAVA_PROPERTIES $MAHOUT_OPTS -classpath
"$CLASSPATH" $CLASS "$@"
else
echo "Running on hadoop, using HADOOP_HOME=$HADOOP_HOME"
if [ "$HADOOP_CONF_DIR" = "" ] ; then
@@ -213,7 +221,7 @@ else
exit 1
else
export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
- exec "$HADOOP_HOME/bin/hadoop" jar $MAHOUT_JOB $CLASS "$@"
+ exec "$HADOOP_HOME/bin/hadoop" jar $MAHOUT_JOB $CLASS "$@" $JAVA_PROPERTIES
fi
fi
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.