hadoop job config parameter,e.g., -Dmapred.cache.archives, support in mahout 
wrapper
------------------------------------------------------------------------------------

                 Key: MAHOUT-573
                 URL: https://issues.apache.org/jira/browse/MAHOUT-573
             Project: Mahout
          Issue Type: Improvement
          Components: Utils
    Affects Versions: 0.4
         Environment: fedora 14 running on VirtualBox for Windows
Windows Vespa
            Reporter: Shige Takeda
            Priority: Minor


In order to specify a custom analyzer that utilizes a Japanese Morphological 
Analyzer "Igo" referring to dictionary files on HDFS for seq2sparse, I needed 
to pass the following job config:

mapred.cache.archives="hdfs://localhost:9000/user/stakeda/ipadic.zip#ipadic
mapred.create.symlink=yes

This way, the IgoAnalyzer can read dictionaries from "./ipadic" as follows:
https://github.com/smtakeda/mahout/blob/project101210/examples/src/main/java/org/apache/mahout/analysis/IgoAnalyzer.java

Other use case is I needed to specify mapred.job.queue.name to something to get 
appropriate priority for running jobs in  the work environment:
https://github.com/smtakeda/mahout/blob/yahoo/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java
...
conf.set("mapred.job.queue.name", "unfunded"); 

Based on these two use cases, I would like to request/propose to add hadoop job 
option support, i.e., -Dmapred.cache.archives=... to mahout wrapper.

Changes are roughly expected in two ends; "bin/mahout" and all main functions 
that parse command lines. Here is a quick patch for "bin/mahout":

localhost ~/workspace/mahout_git/bin: git diff -r 
f13e517408f20f75009e05e6c72c5fbb836e3f66 mahout 
diff --git a/bin/mahout b/bin/mahout
index 774fa11..9d78ceb 100755
--- a/bin/mahout
+++ b/bin/mahout
@@ -116,6 +116,14 @@ CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
 # so that filenames w/ spaces are handled correctly in loops below
 IFS=
 
+
+# JAVA_PROPERTIES
+JAVA_PROPERTIES=
+while [ $1 ] && [ ${1:0:2} == "-D" ] ; do 
+    JAVA_PROPERTIES="$1 $JAVA_PROPERTIES"
+    shift
+done
+
 if [ $IS_CORE == 0 ] 
 then
   # add release dependencies to CLASSPATH
@@ -198,7 +206,7 @@ if [ "$HADOOP_HOME" = "" ] || [ "$MAHOUT_LOCAL" != "" ] ; 
then
   elif [ "$MAHOUT_LOCAL" != "" ] ; then 
     echo "MAHOUT_LOCAL is set, running locally"
   fi
-  exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
+  exec "$JAVA" $JAVA_HEAP_MAX $JAVA_PROPERTIES $MAHOUT_OPTS -classpath 
"$CLASSPATH" $CLASS "$@"
 else
   echo "Running on hadoop, using HADOOP_HOME=$HADOOP_HOME"
   if [ "$HADOOP_CONF_DIR" = "" ] ; then
@@ -213,7 +221,7 @@ else
     exit 1
   else
   export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
-  exec "$HADOOP_HOME/bin/hadoop" jar $MAHOUT_JOB $CLASS "$@"
+  exec "$HADOOP_HOME/bin/hadoop" jar $MAHOUT_JOB $CLASS "$@" $JAVA_PROPERTIES
   fi 
 fi


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to