[ 
https://issues.apache.org/jira/browse/OOZIE-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933753#comment-15933753
 ] 

Attila Sasvari commented on OOZIE-2821:
---------------------------------------

[~rkanter] Thanks for sharing the design of MAPREDUCE­-6415. Creating the HAR 
file in fact requires to execute an MR job (with {{hadoop archive}}). It would 
be pretty easy to extend sharelib upload that MR job after uploading the jars 
(following [this | 
https://github.com/apache/hadoop/blob/b970446b2c59f8897bb2c3a562fa192ed3452db5/hadoop-tools/hadoop-archives/src/test/java/org/apache/hadoop/tools/TestHadoopArchives.java]
 or the hadoop CLI command). 

Or we can run hadoop archive with input set to local filesystem and output to 
HDFS (without uploading to the jars to HDFS first). I just tested that I can do 
this.

{code}
$ bin/hadoop archive -archiveName sharelib.har -p 
file:///Users/asasvari/workspace/apache/oozie/distro/target/oozie-4.4.0-SNAPSHOT-distro/oozie-4.4.0-SNAPSHOT/
 share -r 1 /user/asasvari/oozie/
...
bin/hadoop archive -archiveName sharelib.har -p  share -r 1   4.68s user 0.27s 
system 24% cpu 20.164 total

$ bin/hadoop dfs -ls /user/asasvari/oozie/sharelib.har
-rw-r--r--   1 asasvari supergroup          0 2017-03-20 23:35 
/user/asasvari/oozie/sharelib.har/_SUCCESS
-rw-r--r--   5 asasvari supergroup      63156 2017-03-20 23:35 
/user/asasvari/oozie/sharelib.har/_index
-rw-r--r--   5 asasvari supergroup         25 2017-03-20 23:35 
/user/asasvari/oozie/sharelib.har/_masterindex
-rw-r--r--   1 asasvari supergroup  330827053 2017-03-20 23:35 
/user/asasvari/oozie/sharelib.har/part-0
{code}

What we could also do is to use archives for the sharelib action types. Each 
action type could be in a compressed file (say in a zip). This way, it would be 
also faster to upload the things (only one file per action type to upload 
instead of hundred). We could add the archive with 
{{DistributedCache.addCacheArchive()}} (in {{JavaActionExecutor}}). 
Unfortunately, I do not know how we could easily add the jars in the archives 
to the classpath - {{DistributedCache.addArchiveToClassPath()}} will add the 
uncompressed dir (say pig.zip) to the classpath, but not {{pig.zip/*}}. Even if 
we find a solution, it would make it harder to see the actual content of the 
sharelib (first get, then decompress the archive, or add some metadata file 
next to the sharelibe action type or something).



> Using Hadoop Archives for Oozie ShareLib
> ----------------------------------------
>
>                 Key: OOZIE-2821
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2821
>             Project: Oozie
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>
> Oozie ShareLib is a collection of lots of jar files that are required by 
> Oozie actions. Right now, these jars are uploaded one by one with Oozie 
> ShareLib installation. There can more hundreds of such jars, and many of them 
> are pretty small, significantly smaller than a HDFS block size. Storing a 
> large number of small files in HDFS is inefficient (for example due to the 
> fact that there is an object maintained for each file in the NameNode's 
> memory and blocks containing the small files might be much bigger then the 
> actual files). When an action is executed, these jar files are copied to the 
> distributed cache.
> It  would worth to investigate the possibility of using [Hadoop 
> archives|http://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html]
>  for handling  Oozie ShareLib files, because it could result in better 
> utilisation of HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to