On Thursday 12 August 2010 07:56 PM, Kaluskar, Sanjay wrote:
Hi Mridul,
[BTW thanks, I am glad to see some help on this mailing list - I have
been burnt out by this problem!]
I am not sure I understand the short term solution. It seems like you
are still suggesting over-writing some of the files; wouldn't that break
some of the dependencies? Let me give you a very specific example.
Suppose I write a UDF with 2 dependencies - a.jar and b.jar. Further
suppose that both have a configuration file called types.xsd (stored as
META-INF/resources/types.xsd in each jar), which is accessed at runtime
(through this.getClass().getResource() by specifying the location of the
file. Now, when I register both the jars, PIG will expand& re-package
everything into a single jar, which means that one of the types.xsd will
be overwritten. This means that either a.jar or b.jar won't function as
expected.
I was not thinking of meta-inf dependencies, my mistake - you are right,
it will fail for that.
I was assuming only of class resolution : and typically, overwriting in
reverse order should be relatively fine (it is not a general solution,
there are corner cases where it will fail).
That is the reason, I am aiming for a solution that lets me specify all
the dependencies on the classpath. These 150 dependencies are pretty
much like 3rd party software for me; I don't really understand them well
enough or control them (and really, I shouldn't have to do that else it
would get very hard to use any software).
In this case, using the URLClassLoader and reflection based second
"solution" should probably work for you ?
You should be careful to ensure that no references to the actual
business logic is made 'directly' - only though classes you create via
reflection.
Regards,
Mridul
Right now, my workaround is fairly robust but ugly - I am adding the
top-level jar to HADOOP-CLASSPATH. That jar lists a.jar, b.jar, ... in
the list of files in Class-Path in META-INF/MANIFEST.MF.
-sanjay
-----Original Message-----
From: Mridul Muralidharan [mailto:[email protected]]
Sent: Thursday, August 12, 2010 1:03 PM
To: [email protected]
Cc: Kaluskar, Sanjay
Subject: Re: Adding entries to classpath
A short term alternative would be to find out the order in which pig
expands the jars, and ensure that your jars are expanded in reverse
order.
As in, if you need your classpath to be "a.jar:b.jar:c.jar", and pig
un-jar's the register'ed jar in the order they are specified in the
script, then simply register them in reverse order -
register c.jar;
register b.jar;
register a.jar;
(I am assuming an order of expansion here, and also that there IS an
order to begin with !).
This would be consistent with how java loads the classes for most part
(unless you have jar level tricky dependencies : I am ignoring that
possibility for now).
Worth a shot anyway while we wait for pig/hadoop to fix for next
release.
Another alternative might be to add all dependencies into an archive,
'expand' this in an init block in your udf, use a URLClassLoader to load
this and use reflection to invoke your code : possibly I might be
missing something, but it looks workable ...
Regards,
Mridul
On Thursday 12 August 2010 07:49 AM, Kaluskar, Sanjay wrote:
Thanks Ashutosh, I will try that out.
Arun,
I had already explained why I can't register the 150 jars (very
tedious, error prone and PIG then unpacks& re-packs which ends up
over-writing some of the resource files that have the same names). I
also explained why dist cache doesn't work in this scenario (because
specifying the jars individually doesn't preserve the dir structure,
and specifying the zip file doesn't allow adding the jar to the
classpath). I have been trying this out for a few days using various
options suggested in the doc. Finally, I started reading the hadoop
source code and discovered why none of the solutions would work. I
would actually fix the mapred.child.java.opts to allow adding to the
classpath, if it were my choice because it is a generic solution, it
would be consistent with how java.library.path is handled. I would
also fix PIG to not try& mangle all the registered jars - I have been
burnt by that. I think PIG should instead put all the registered jars
on the classpath.
-sanjay
-----Original Message-----
From: Ashutosh Chauhan [mailto:[email protected]]
Sent: Wednesday, August 11, 2010 10:39 PM
To: [email protected]
Cc: [email protected]
Subject: Re: Adding entries to classpath
Adding pig-user@
Sanjay,
You can do this in Pig by setting following -D switch at the command
line through which you invoke Pig.
-Dpig.streaming.ship.files=myTopLevel.jar
In 0.8 release you will be able to do this from within Pig script like
set pig.streaming.ship.files myTopLevel.jar;
Note that this is just to unblock you. Its an internal Pig property
which is not exposed to the users and may break your script if your
are also using Streaming from within Pig. We need to find a long term
solution for your particular use case.
Hope it helps,
Ashutosh
On Wed, Aug 11, 2010 at 09:30, Arun C Murthy<[email protected]>
wrote:
Moving to mapreduce-user@, bcc common-u...@.
Why do you need to create a single top-level jar? Just register each
of your jars and put each in the distributed cache... however you
have
150 jars which is a lot. Is there a way you can decrease that? I'm
sure how you do this in pig, but in MR you have the ability to add a
jar in the DC to the classpath of the child
(DistributedCache.addFileToClassPath).
Hope that helps.
Arun
On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
I am using Hadoop indirectly through PIG, and some of the UDFs
(defined
by me) need other jars at runtime (around 150) some of which have
conflicting resource names. Hence, trying to unpack all of them and
repacking into a single jar doesn't work. My solution is to create a
single top-level jar that names all the dependencies in Class-Path
in
the MANIFEST.MF. This is also simpler from a user's point of view.
Of
course this requires the top-level jar and all the dependencies to
be
created with a certain directory structure that I can control.
Currently, I have a structure where I have a root directory which
contains the top-level jar and a directory called lib, and all the
dependencies are in lib, and the top-level jar names the
dependencies
as lib/x.jar lib/y.jar etc. I package all of this as a single zip
file for easy installation.
Just to be clear this is the dir structure:
root dir
|
|--- top-level.jar
|--- lib
|--- x.jar
|--- y.jar
I can't register top-level.jar in my PIG script (this is the
recommended
approach) because PIG then unpacks& repackages everything into a
single jar, instead of including the jar on the classpath. I can't
use distributed cache because if I specify top-level.jar and lib
separately in mapred.cache.files, then the relative directory
locations aren't preserved. If I use the mapred.cache.archives
option
and specify the zip file, I can't add the top-level jar to the
classpath (because the entries in mapred.job.classpath.files must be
something from mapred.cache.files).
If mapred.child.java.opts also allowed java.class.path to be
augmented (similar to java.library.path, which I am using for native
libs that I store in another dir parallel to lib), it would have
solved my problem.
I could have specified the zip in mapred.cache.archives, and added
the jar to the classpath. Right now I can't see any solution, other
than using a shared file system and adding top-level.jar to
HADOOP_CLASSPATH
- this works because I am using a small cluster that has a shared
file system but clearly it's not always feasible (and of course,
it's
modifying Hadoop's environment).
Please suggest any alternatives you can think of.
Thanks,
-sanjay