This is for hadoop-0.20-append and any other hadoop-0.20.* and later I believe.
We run hadoop tasks within an embedded runtime (z2-environment) that has its own class loading hierarchy (not like OSGi but should have the same problem there). The actual Mapper (Reducer etc) is a generic implementation that delegates to a Mapper (Reducer etc.) in some component loaded by a child loader of the system class loader. When using custom input format implementations, those will not be found unless present on the system class path. The reason for that is that MapTask and ReduceTask do not use the right class loader for retrieving the input format class (see MapTask.runNewMapper()). Both use the class loader in taskContext.getConfiguration() which is not set appropriately at that point in time. We fixed that by a) ...having the generic mappers/reducers implement Configurable b) ...setting Configuration.setClassLoader during setConf calls on those c) ...inserting taskContext.getConfiguration().setClassLoader(job.getClassLoader()); in MapTask / ReduceTask before retrieving the input format class. I suggest to include taskContext.getConfiguration().setClassLoader(job.getClassLoader()); at MapTask#574 (branch-0.20-append) and ReduceTask#555 resp. so that mappers/reducers can use the configuration object to set the class loader used during retrieval of classes in the task context. That is, unless somebody has a better fix of course or there is some other misunderstanding from my side. Has this issue been identified before (didn't find a match - but there are so many currently)? Thanks, Henning
