Hi, If your records are structured / of equal size, then getting the line number is straightforward. If not, you'll need to construct your own sequence of numbers, someone's been kind enough to publish on his blog:
http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html Amogh On 4/5/10 7:59 PM, "Michael Segel" <[email protected]> wrote: > Date: Mon, 5 Apr 2010 14:57:09 +0100 > From: [email protected] > To: [email protected] > Subject: Get Line Number from InputFormat > > Dear all, > TextInputFormat send the <Offset, Line> into the Mapper, however, the > offset is sometime meaningless, and confusing. Is it possible to have a > InputFormat which outputs <Line NO., line> into mapper? > > Thanks a lot. > > Song Song, I'm not sure what you want is realistic or even worthwhile. You have a file and its split in to chunks of 64MB (default) or something larger based on your cloud settings. You have map job that starts from a specific point in to the file, but that does not mean that its starting at a specific line, or that Hadoop will know which line in the file. (Your records are not always going to be based on the end of a line, or one like per record. Does that make sense? Offset has more meaning that an arbitrary Line NO. -Mike _________________________________________________________________ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
