[ https://issues.apache.org/jira/browse/PIG-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126817#comment-13126817 ]
Dmitriy V. Ryaboy commented on PIG-2318: ---------------------------------------- Good start, Julien. public void setExtraJarsInDistributedCache -- seems like we'll need an additional addJarToDistributedCache method, to avoid forcing function users to rewrite the array every time themselves. log.info("Adding jar to DistributedCache: " + jar); -- this should be debug level You special case file and hdfs protocols specifically. What happens to other protocols? I believe the way we deal with hdfs jars is we copy them over to the local fs in order to drop them onto the local classpath, anyway. Presumably that'll work the same way for s3n://, for example. We could at least treat those as local jars and pick up the local copy to ship to the cluster? Not related to your change, but any ideas why skipJars is a Vector? Seems like that's not necessary... Some of your comments such as the one about the PigContext constructor should probably be in javadoc format so they make it out to the world.. knowing what does and does not get serialized would be handy. Did you notice an appreciable improvement in startup time when using this on our cluster? > Push extra jars to distributed cache and use the classloader enxtension > mechanism in PigContext to load them on the backend > --------------------------------------------------------------------------------------------------------------------------- > > Key: PIG-2318 > URL: https://issues.apache.org/jira/browse/PIG-2318 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Julien Le Dem > Assignee: Julien Le Dem > Attachments: PIG-2318.patch > > > This is related to PIG-2010 with a slightly different approach > https://issues.apache.org/jira/browse/PIG-2010 > Currently Pig bundles up all dependencies in a single jar which is a lot of > overhead when there are a lot of dependencies and short lived jobs. This > patch instead pushes the dependencies to distributed cache and uses the > PigContext classloading mechanism to make the UDFs available. > Possible improvements: push jars to HDFS/distributed cache only once per > script. have a cache on HDFS to avoid repeatedly pushing jars to HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira