Azim Uddin created HIVE-7288:
--------------------------------
Summary: Enable support for -libjars and -archives in WebHcat for
Hadoop Streaming jobs
Key: HIVE-7288
URL: https://issues.apache.org/jira/browse/HIVE-7288
Project: Hive
Issue Type: New Feature
Components: WebHCat
Affects Versions: 0.13.1, 0.13.0, 0.12.0, 0.11.0
Environment: HDInsight deploying HDP 2.1; Also HDP 2.1 on Windows
Reporter: Azim Uddin
Issue:
======
Due to lack of parameters (or support for) equivalent of '-libjars' and
'-archives' in WebHcat REST API, we cannot use an external Java Jars or Archive
files with a Hadoop Streaming job, when the job is submitted via
WebHcat/templeton.
I am citing a few use cases here, but there can be plenty of scenarios like
this-
#1 (for -archives):In order to use R with a hadoop distribution like HDInsight
or HDP on Windows, we could package the R directory up in a zip file and rename
it to r.jar and put it into HDFS or WASB. We can then do
something like this from hadoop command line (ignore the wasb syntax, same
command can be run with hdfs) -
hadoop jar %HADOOP_HOME%\lib\hadoop-streaming.jar -archives
wasb:///example/jars/r.jar -files
"wasb:///example/apps/mapper.r,wasb:///example/apps/reducer.r" -mapper
"./r.jar/bin/Rscript.exe
mapper.r" -reducer "./r.jar/bin/Rscript.exe reducer.r" -input
/example/data/gutenberg -output /probe/r/wordcount
This works from hadoop command line, but due to lack of support for '-archives'
parameter in WebHcat, we can't do the same via WebHcat.
#2 (for -libjars):
Consider a scenario where a user would like to use a custom inputFormat with a
Hadoop Streaming job and wrote his own custom InputFormat JAR. From a hadoop
command line we can do something like
this -
hadoop jar /path/to/hadoop-streaming.jar \
-libjars /path/to/custom-formats.jar \
-D map.output.key.field.separator=, \
-D mapred.text.key.partitioner.options=-k1,1 \
-input my_data/ \
-output my_output/ \
-outputformat test.example.outputformat.DateFieldMultipleOutputFormat \
-mapper my_mapper.py \
-reducer my_reducer.py \
But due to lack of support for '-libjars' parameter for hadoop streaming job in
WebHcat, we can't submit the above hadoop streaming job (that uses a custom
Java JAR) via WebHcat.
Impact:
========
We think, being able to submit jobs remotely is a vital feature for hadoop to
be enterprise-ready and WebHcat plays an important role there. Hadoop Streaming
job is also very important for
interoperability. So, it would be very useful to keep WebHcat on par with
hadoop command line in terms of streaming job submission capability.
Ask:
====
Enable parameter support for 'libjars' and 'archives' in WebHcat for Hadoop
streaming jobs in WebHcat.
--
This message was sent by Atlassian JIRA
(v6.2#6252)