[
https://issues.apache.org/jira/browse/OOZIE-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934924#comment-15934924
]
Robert Kanter commented on OOZIE-2821:
--------------------------------------
That's a good idea using the local filesystem as the input. We could use this
instead of directly uploading.
I'm not sure I follow the part about a zip though. The thing about HAR
archives is that they're not _really_ archives. They can be accessed natively
by Hadoop if you use the {{har://}} schema. So if we put the sharelib jars
into HAR file(s), we could easily add them to the launcher job the same way we
do today, but change the path to the corresponding {{har://} path. Another
thing to keep in mind if we do something with actual archives (i.e. zip files)
is that they have to be extracted when being localized, which may add some
overhead for larger sharelib dirs (e.g. Spark).
However, you are right that doing a HAR file would make it harder to manually
add extra files to the sharelib. Given that, I think making a separate HAR for
each sharelib type makes the most sense. So we could have:
{noformat}
/oozie/share/lib/hive/oozie-hive-sharelib.har (has all of the Oozie supplied
hive jars)
/oozie/share/lib/hive/custom.jar (user manually uploaded this file)
/oozie/share/lib/hive/hive-site.xml (user manually uploaded this file)
{noformat}
> Using Hadoop Archives for Oozie ShareLib
> ----------------------------------------
>
> Key: OOZIE-2821
> URL: https://issues.apache.org/jira/browse/OOZIE-2821
> Project: Oozie
> Issue Type: New Feature
> Reporter: Attila Sasvari
>
> Oozie ShareLib is a collection of lots of jar files that are required by
> Oozie actions. Right now, these jars are uploaded one by one with Oozie
> ShareLib installation. There can more hundreds of such jars, and many of them
> are pretty small, significantly smaller than a HDFS block size. Storing a
> large number of small files in HDFS is inefficient (for example due to the
> fact that there is an object maintained for each file in the NameNode's
> memory and blocks containing the small files might be much bigger then the
> actual files). When an action is executed, these jar files are copied to the
> distributed cache.
> It would worth to investigate the possibility of using [Hadoop
> archives|http://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html]
> for handling Oozie ShareLib files, because it could result in better
> utilisation of HDFS.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)