Bumping this thread because currently I'm more aware of what is actually
happening. If I understand correctly, when submitting jobs using RunJar
the classpath is extended using a new classloader. This classloader adds
the unzipped contents from the jar to the current thread classpath
(contextClassLoader). This brings 2 issues to mind:
1) In RunJar, when constructing the new URLClassLoader, would it not be
better to chain the *previously* contextClassLoader instead of using the
system classloader? (The latter is used when the classloader argument is
omitted in the ctor of URLClassLoader, which is what RunJar does). This
is a truely a minor issue, since most of the times RunJar is used as a
result of invocating 'hadoop jar' from the command line and therefore
the previous thread contextClassLoader actually will be the system
classloader. I bring this up for at least trying to understand the process.
2) To proceed on my previous findings on AbstractMapWritable, I think
the problem of it unable to find classes is because it is loaded by a
parent classloader (system classloader) instead of the new child
classloader set by RunJar. The classloader of AbstractMapWritable is not
this child classloader because it is already loaded (indirectly in
Configuration) before the thread contextClassLoader is replaced in
RunJar, therefore it is unable to find certain extracted classes. So why
does AbstractMapWritable use the classloader of it's class
[Class.forName(className)] instead of the current thread
[Class.forName(className, true,
Thread.currentThread().getContextClassLoader())]. Is it not wiser to
always use the latter construction in general classloading code?
Ferdy.
On 09/09/2011 11:54 AM, Ferdy Galema wrote:
Sometimes when running hadoop jobs using the 'hadoop jar' command
there are issues with the classloader. I presume these are caused by
classes that are loaded BEFORE the commands main is invoced. For
example, when relying on the MapWritable in the command, it is not
possible to use a class that is not in the default idToClassMap.
MapWritable.class is loaded before the user job is unpacked and
therefore it's classloader will not be able to find custom classes.
(At least classes that are in the RunJar it's classloader classpath).
I could not find any issues or discussion about this so I assume it is
somewhat of an obscure issue (please correct me if I'm wrong). Anyway
I would like to hear what you think of this and perhaps discuss a
possible solution. Such as spawning the command in a new JVM.
MapWritable or rather AbstractMapWritable uses a
Class.forName(className) construction, maybe this can be changed so
that uses the classloader of the current thread instead of its own
class. (Will this work?)
A workaround for now is to explicitely put the jar itself on the
classpath, i.e. 'env HADOOP_CLASSPATH=myJar hadoop jar myJar'.