[
https://issues.apache.org/jira/browse/HIVE-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Howell updated HIVE-7288:
-------------------------------
Tags: hadoop streaming, WebHcat, libjars, archives, CSS (was: hadoop
streaming, WebHcat, libjars, archives)
> Enable support for -libjars and -archives in WebHcat for Streaming MapReduce
> jobs
> ---------------------------------------------------------------------------------
>
> Key: HIVE-7288
> URL: https://issues.apache.org/jira/browse/HIVE-7288
> Project: Hive
> Issue Type: New Feature
> Components: WebHCat
> Affects Versions: 0.11.0, 0.12.0, 0.13.0, 0.13.1
> Environment: HDInsight deploying HDP 2.1; Also HDP 2.1 on Windows
> Reporter: Azim Uddin
> Assignee: shanyu zhao
> Attachments: HIVE-7288.1.patch, hive-7288.patch
>
>
> Issue:
> ======
> Due to lack of parameters (or support for) equivalent of '-libjars' and
> '-archives' in WebHcat REST API, we cannot use an external Java Jars or
> Archive files with a Streaming MapReduce job, when the job is submitted via
> WebHcat/templeton.
> I am citing a few use cases here, but there can be plenty of scenarios like
> this-
> #1
> (for -archives):In order to use R with a hadoop distribution like HDInsight
> or HDP on Windows, we could package the R directory up in a zip file and
> rename it to r.jar and put it into HDFS or WASB. We can then do
> something like this from hadoop command line (ignore the wasb syntax, same
> command can be run with hdfs) -
> hadoop jar %HADOOP_HOME%\lib\hadoop-streaming.jar -archives
> wasb:///example/jars/r.jar -files
> "wasb:///example/apps/mapper.r,wasb:///example/apps/reducer.r" -mapper
> "./r.jar/bin/Rscript.exe mapper.r" -reducer "./r.jar/bin/Rscript.exe
> reducer.r" -input /example/data/gutenberg -output /probe/r/wordcount
> This works from hadoop command line, but due to lack of support for
> '-archives' parameter in WebHcat, we can't submit the same Streaming MR job
> via WebHcat.
> #2 (for -libjars):
> Consider a scenario where a user would like to use a custom inputFormat with
> a Streaming MapReduce job and wrote his own custom InputFormat JAR. From a
> hadoop command line we can do something like this -
> hadoop jar /path/to/hadoop-streaming.jar \
> -libjars /path/to/custom-formats.jar \
> -D map.output.key.field.separator=, \
> -D mapred.text.key.partitioner.options=-k1,1 \
> -input my_data/ \
> -output my_output/ \
> -outputformat test.example.outputformat.DateFieldMultipleOutputFormat
> \
> -mapper my_mapper.py \
> -reducer my_reducer.py \
> But due to lack of support for '-libjars' parameter for streaming MapReduce
> job in WebHcat, we can't submit the above streaming MR job (that uses a
> custom Java JAR) via WebHcat.
> Impact:
> ========
> We think, being able to submit jobs remotely is a vital feature for hadoop to
> be enterprise-ready and WebHcat plays an important role there. Streaming
> MapReduce job is also very important for interoperability. So, it would be
> very useful to keep WebHcat on par with hadoop command line in terms of
> streaming MR job submission capability.
> Ask:
> ====
> Enable parameter support for 'libjars' and 'archives' in WebHcat for Hadoop
> streaming jobs in WebHcat.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)