FYI, basing off of Antonio's great work, I finally got around to making this
InputFormat tonight: see http://github.com/kevinweil/IntegerListInputFormat.

If people are interested, I'm happy to format it and license it
appropriately and commit it to core hadoop.  Let me know, otherwise I'll
just leave it as is.

Kevin

On Sun, Dec 13, 2009 at 7:03 PM, Kevin Weil <[email protected]> wrote:

> Antonio,
>
> If you're interested in open sourcing this, I'd be interested in
> using/helping.  We do something internally that's similar to this, and I've
> been meaning to write a general-purpose version for a while.  Would love to
> see what you've done and contribute back any changes we make.  Github?
>
> Thanks,
> Kevin
>
>
> On Fri, Dec 11, 2009 at 11:23 AM, Antonio D'Ettole <[email protected]>wrote:
>
>> Philip,
>>
>> that was quick and precise. I learned something today. Thank you!
>> Antonio
>>
>> On Fri, Dec 11, 2009 at 8:20 PM, Philip Zeyliger <[email protected]
>> >wrote:
>>
>> > Hi Antonio,
>> >
>> > Check out MapTask.java.  When your job gets instantiated on the cluster,
>> an
>> > InputSplit object is created for the task, using reflection.  An
>> InputSplit
>> > is a Writable, and, like all writables, it gets created with an empty
>> > constructor and initialized with readFields().
>> >
>> > If you implement write() and readFields() correctly (think of these as
>> > serialization and de-serialization functions), it should all work.  See
>> > FileSplit for an example of how FileInputFormat does it.
>> >
>> > Cheers,
>> >
>> > -- Philip
>> >
>> > Here's a code excerpt from MapTask.java, that's relevant:
>> >
>> >
>> >  void runOldMapper(final JobConf job,
>> > >                     final BytesWritable rawSplit,
>> > >                     final TaskUmbilicalProtocol umbilical,
>> > >                     TaskReporter reporter
>> > >                     ) throws IOException, InterruptedException,
>> > >                              ClassNotFoundException {
>> > >     InputSplit inputSplit = null;
>> > >     // reinstantiate the split
>> > >     try {
>> > >       inputSplit = (InputSplit)
>> > >         ReflectionUtils.newInstance(job.getClassByName(splitClass),
>> job);
>> > >     } catch (ClassNotFoundException exp) {
>> > >       IOException wrap = new IOException("Split class " + splitClass +
>> > >                                          " not found");
>> > >       wrap.initCause(exp);
>> > >       throw wrap;
>> > >     }
>> > >     DataInputBuffer splitBuffer = new DataInputBuffer();
>> > >     splitBuffer.reset(split.getBytes(), 0, split.getLength());
>> > >     inputSplit.readFields(splitBuffer);
>> > >
>> >
>> >
>> >
>> > On Fri, Dec 11, 2009 at 11:03 AM, Antonio D'Ettole <[email protected]
>> > >wrote:
>> >
>> > > Hi,
>> > >
>> > > I've been trying to code a pretty simple InputFormat. The idea is
>> this: I
>> > > have an array of numbers (say, the range [0-5000]) and I want each
>> mapper
>> > > to
>> > > receive a split of size 500 i.e. 500 LongWritable's.
>> > >
>> > > this is an excerpt from the class extending InputSplit:
>> > >
>> > > public class myInputSplit extends InputSplit implements Writable {
>> > >
>> > > long[] rows;
>> > >        myInputSplit(){ }
>> > >
>> > > public myInputSplit(long[] rows) {
>> > > this.rows=rows;
>> > > }
>> > >
>> > >    .....
>> > >
>> > > }
>> > >
>> > > I also wrote the classes myInputFormat and myRecordReader (omitted).
>> > >
>> > > Now, the default constructor in the class above doesn't do much but I
>> had
>> > > to
>> > > put it there anyway because hadoop was throwing an exception at
>> runtime
>> > > because it couldn't find said constructor. Obviously myInputFormat
>> uses
>> > the
>> > > right constructor with the long[] argument, but hadoop sems somehow to
>> > give
>> > > the mapper input splits which have been built using the default
>> > > constructor,
>> > > which is used nowhere in my code. I can tell because i put a
>> breakpoint
>> > in
>> > > the default constructor and yes, it is being called. As a result all
>> the
>> > > input splits that are processed by the mappers are "broken" as the
>> "rows"
>> > > variable was never set.
>> > > Interestingly, I also put a breakpoint in the _right_ constructor and
>> it
>> > is
>> > > also being called, by the getSplits() method in myInputFormat (which
>> is
>> > > what
>> > > one would expect)
>> > >
>> > > Does anybody have an idea why the default constructor is being called?
>> > >
>> > > I hope I was clear enough, thanks for your time.
>> > > Antonio
>> > >
>> >
>>
>
>

Reply via email to