Hi Antonio,
Check out MapTask.java. When your job gets instantiated on the cluster, an
InputSplit object is created for the task, using reflection. An InputSplit
is a Writable, and, like all writables, it gets created with an empty
constructor and initialized with readFields().
If you implement write() and readFields() correctly (think of these as
serialization and de-serialization functions), it should all work. See
FileSplit for an example of how FileInputFormat does it.
Cheers,
-- Philip
Here's a code excerpt from MapTask.java, that's relevant:
void runOldMapper(final JobConf job,
> final BytesWritable rawSplit,
> final TaskUmbilicalProtocol umbilical,
> TaskReporter reporter
> ) throws IOException, InterruptedException,
> ClassNotFoundException {
> InputSplit inputSplit = null;
> // reinstantiate the split
> try {
> inputSplit = (InputSplit)
> ReflectionUtils.newInstance(job.getClassByName(splitClass), job);
> } catch (ClassNotFoundException exp) {
> IOException wrap = new IOException("Split class " + splitClass +
> " not found");
> wrap.initCause(exp);
> throw wrap;
> }
> DataInputBuffer splitBuffer = new DataInputBuffer();
> splitBuffer.reset(split.getBytes(), 0, split.getLength());
> inputSplit.readFields(splitBuffer);
>
On Fri, Dec 11, 2009 at 11:03 AM, Antonio D'Ettole <[email protected]>wrote:
> Hi,
>
> I've been trying to code a pretty simple InputFormat. The idea is this: I
> have an array of numbers (say, the range [0-5000]) and I want each mapper
> to
> receive a split of size 500 i.e. 500 LongWritable's.
>
> this is an excerpt from the class extending InputSplit:
>
> public class myInputSplit extends InputSplit implements Writable {
>
> long[] rows;
> myInputSplit(){ }
>
> public myInputSplit(long[] rows) {
> this.rows=rows;
> }
>
> .....
>
> }
>
> I also wrote the classes myInputFormat and myRecordReader (omitted).
>
> Now, the default constructor in the class above doesn't do much but I had
> to
> put it there anyway because hadoop was throwing an exception at runtime
> because it couldn't find said constructor. Obviously myInputFormat uses the
> right constructor with the long[] argument, but hadoop sems somehow to give
> the mapper input splits which have been built using the default
> constructor,
> which is used nowhere in my code. I can tell because i put a breakpoint in
> the default constructor and yes, it is being called. As a result all the
> input splits that are processed by the mappers are "broken" as the "rows"
> variable was never set.
> Interestingly, I also put a breakpoint in the _right_ constructor and it is
> also being called, by the getSplits() method in myInputFormat (which is
> what
> one would expect)
>
> Does anybody have an idea why the default constructor is being called?
>
> I hope I was clear enough, thanks for your time.
> Antonio
>