To the best of my knowledge, the only way to do this is if you have fix width columns.
Think about it this way: as alexandra mentioned, you only get byte difference...if you split 1 file among 50 mappers, they have the offset, but they have no idea that that offset means. with respect to other the other files, as they do not know how many lines came before. Finding lines inherently involves a full scan, unless a) the width is fixed or b) you do a job beforehand to explicitly put the line in the document. I would think about what you want to do, and whether or not it is possible to avoid making it line dependent, or if you can make each row a fixed number of bytes... 2011/5/18 Alexandra Anghelescu <[email protected]> > Hi, > > It is hard to pick up certain lines of a text file - globally I mean. > Remember that the file is split according to its size (byte boundries) not > lines.,, so, it is possible to keep track of the lines inside a split, but > globally for the whole file, assuming it is split among map tasks... i > don't > think it is possible.. I am new to hadoop, but that is my take on it. > > Alexandra > > On Wed, May 18, 2011 at 2:41 PM, bnonymous <[email protected]> wrote: > > > > > Hello, > > > > I'm trying to pick up certain lines of a text file. (say 1st, 110th line > of > > a file with 10^10 lines). I need a InputFormat which gives the Mapper > line > > number as the key. > > > > I tried to implement RecordReader, but I can't get line information from > > InputSplit. > > > > Any solution to this??? > > > > Thanks in advance!!!!!!! > > -- > > View this message in context: > > > http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > >
