[
https://issues.apache.org/jira/browse/OOZIE-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933753#comment-15933753
]
Attila Sasvari commented on OOZIE-2821:
---------------------------------------
[~rkanter] Thanks for sharing the design of MAPREDUCE-6415. Creating the HAR
file in fact requires to execute an MR job (with {{hadoop archive}}). It would
be pretty easy to extend sharelib upload that MR job after uploading the jars
(following [this |
https://github.com/apache/hadoop/blob/b970446b2c59f8897bb2c3a562fa192ed3452db5/hadoop-tools/hadoop-archives/src/test/java/org/apache/hadoop/tools/TestHadoopArchives.java]
or the hadoop CLI command).
Or we can run hadoop archive with input set to local filesystem and output to
HDFS (without uploading to the jars to HDFS first). I just tested that I can do
this.
{code}
$ bin/hadoop archive -archiveName sharelib.har -p
file:///Users/asasvari/workspace/apache/oozie/distro/target/oozie-4.4.0-SNAPSHOT-distro/oozie-4.4.0-SNAPSHOT/
share -r 1 /user/asasvari/oozie/
...
bin/hadoop archive -archiveName sharelib.har -p share -r 1 4.68s user 0.27s
system 24% cpu 20.164 total
$ bin/hadoop dfs -ls /user/asasvari/oozie/sharelib.har
-rw-r--r-- 1 asasvari supergroup 0 2017-03-20 23:35
/user/asasvari/oozie/sharelib.har/_SUCCESS
-rw-r--r-- 5 asasvari supergroup 63156 2017-03-20 23:35
/user/asasvari/oozie/sharelib.har/_index
-rw-r--r-- 5 asasvari supergroup 25 2017-03-20 23:35
/user/asasvari/oozie/sharelib.har/_masterindex
-rw-r--r-- 1 asasvari supergroup 330827053 2017-03-20 23:35
/user/asasvari/oozie/sharelib.har/part-0
{code}
What we could also do is to use archives for the sharelib action types. Each
action type could be in a compressed file (say in a zip). This way, it would be
also faster to upload the things (only one file per action type to upload
instead of hundred). We could add the archive with
{{DistributedCache.addCacheArchive()}} (in {{JavaActionExecutor}}).
Unfortunately, I do not know how we could easily add the jars in the archives
to the classpath - {{DistributedCache.addArchiveToClassPath()}} will add the
uncompressed dir (say pig.zip) to the classpath, but not {{pig.zip/*}}. Even if
we find a solution, it would make it harder to see the actual content of the
sharelib (first get, then decompress the archive, or add some metadata file
next to the sharelibe action type or something).
> Using Hadoop Archives for Oozie ShareLib
> ----------------------------------------
>
> Key: OOZIE-2821
> URL: https://issues.apache.org/jira/browse/OOZIE-2821
> Project: Oozie
> Issue Type: New Feature
> Reporter: Attila Sasvari
>
> Oozie ShareLib is a collection of lots of jar files that are required by
> Oozie actions. Right now, these jars are uploaded one by one with Oozie
> ShareLib installation. There can more hundreds of such jars, and many of them
> are pretty small, significantly smaller than a HDFS block size. Storing a
> large number of small files in HDFS is inefficient (for example due to the
> fact that there is an object maintained for each file in the NameNode's
> memory and blocks containing the small files might be much bigger then the
> actual files). When an action is executed, these jar files are copied to the
> distributed cache.
> It would worth to investigate the possibility of using [Hadoop
> archives|http://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html]
> for handling Oozie ShareLib files, because it could result in better
> utilisation of HDFS.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)