Antonio, If you're interested in open sourcing this, I'd be interested in using/helping. We do something internally that's similar to this, and I've been meaning to write a general-purpose version for a while. Would love to see what you've done and contribute back any changes we make. Github?
Thanks, Kevin On Fri, Dec 11, 2009 at 11:23 AM, Antonio D'Ettole <[email protected]>wrote: > Philip, > > that was quick and precise. I learned something today. Thank you! > Antonio > > On Fri, Dec 11, 2009 at 8:20 PM, Philip Zeyliger <[email protected] > >wrote: > > > Hi Antonio, > > > > Check out MapTask.java. When your job gets instantiated on the cluster, > an > > InputSplit object is created for the task, using reflection. An > InputSplit > > is a Writable, and, like all writables, it gets created with an empty > > constructor and initialized with readFields(). > > > > If you implement write() and readFields() correctly (think of these as > > serialization and de-serialization functions), it should all work. See > > FileSplit for an example of how FileInputFormat does it. > > > > Cheers, > > > > -- Philip > > > > Here's a code excerpt from MapTask.java, that's relevant: > > > > > > void runOldMapper(final JobConf job, > > > final BytesWritable rawSplit, > > > final TaskUmbilicalProtocol umbilical, > > > TaskReporter reporter > > > ) throws IOException, InterruptedException, > > > ClassNotFoundException { > > > InputSplit inputSplit = null; > > > // reinstantiate the split > > > try { > > > inputSplit = (InputSplit) > > > ReflectionUtils.newInstance(job.getClassByName(splitClass), > job); > > > } catch (ClassNotFoundException exp) { > > > IOException wrap = new IOException("Split class " + splitClass + > > > " not found"); > > > wrap.initCause(exp); > > > throw wrap; > > > } > > > DataInputBuffer splitBuffer = new DataInputBuffer(); > > > splitBuffer.reset(split.getBytes(), 0, split.getLength()); > > > inputSplit.readFields(splitBuffer); > > > > > > > > > > > On Fri, Dec 11, 2009 at 11:03 AM, Antonio D'Ettole <[email protected] > > >wrote: > > > > > Hi, > > > > > > I've been trying to code a pretty simple InputFormat. The idea is this: > I > > > have an array of numbers (say, the range [0-5000]) and I want each > mapper > > > to > > > receive a split of size 500 i.e. 500 LongWritable's. > > > > > > this is an excerpt from the class extending InputSplit: > > > > > > public class myInputSplit extends InputSplit implements Writable { > > > > > > long[] rows; > > > myInputSplit(){ } > > > > > > public myInputSplit(long[] rows) { > > > this.rows=rows; > > > } > > > > > > ..... > > > > > > } > > > > > > I also wrote the classes myInputFormat and myRecordReader (omitted). > > > > > > Now, the default constructor in the class above doesn't do much but I > had > > > to > > > put it there anyway because hadoop was throwing an exception at runtime > > > because it couldn't find said constructor. Obviously myInputFormat uses > > the > > > right constructor with the long[] argument, but hadoop sems somehow to > > give > > > the mapper input splits which have been built using the default > > > constructor, > > > which is used nowhere in my code. I can tell because i put a breakpoint > > in > > > the default constructor and yes, it is being called. As a result all > the > > > input splits that are processed by the mappers are "broken" as the > "rows" > > > variable was never set. > > > Interestingly, I also put a breakpoint in the _right_ constructor and > it > > is > > > also being called, by the getSplits() method in myInputFormat (which is > > > what > > > one would expect) > > > > > > Does anybody have an idea why the default constructor is being called? > > > > > > I hope I was clear enough, thanks for your time. > > > Antonio > > > > > >
