M/R output with meaningful keys and values

Shannon Quinn Tue, 15 Jun 2010 15:32:15 -0700

Hi all,

I apologize to anyone on the common-dev list, as I mistakenly postedthis question there first.

I am a GSoC student working on the Mahout project, but right now I amhaving difficulty employing the Hadoop map/reduce API for reading mydata into the program in the first place. Specifically, I am wonderingabout generating SequenceFiles from CSV files. The CSV files I aminterested in are matrix representations; each line corresponds to arow, and each comma-separated value corresponds to a column. I know thatTextInputFormat will split according to each newline, but the keyprovided is the byte offset, rather than the line number. Ideally, I'dlike to generate a Vector of each CSV row's elements and use the linenumber as its key.

However, this byte offset could still be useful if, at the end of theM/R task I could sort all the Vectors according to their keys and usethat ordering as the matrix. The documentation states that no sortingoccurs after the Reduce task, or at the end of the Map task if Reduce isnot used, so this approach seems unlikely to work. Would I instead needto define a new InputFormat, or a new RecordReader, in order to createmeaningful keys and corresponding values? Or is there another strategy(counters?) that I could use to accomplish mapping the line numbers ofthe CSV files to rows in the ensuing matrix?


Thanks in advance!

Regards,
Shannon

M/R output with meaningful keys and values

Reply via email to