Hi all,

I apologize to anyone on the common-dev list, as I mistakenly posted this question there first.

I am a GSoC student working on the Mahout project, but right now I am having difficulty employing the Hadoop map/reduce API for reading my data into the program in the first place. Specifically, I am wondering about generating SequenceFiles from CSV files. The CSV files I am interested in are matrix representations; each line corresponds to a row, and each comma-separated value corresponds to a column. I know that TextInputFormat will split according to each newline, but the key provided is the byte offset, rather than the line number. Ideally, I'd like to generate a Vector of each CSV row's elements and use the line number as its key.

However, this byte offset could still be useful if, at the end of the M/R task I could sort all the Vectors according to their keys and use that ordering as the matrix. The documentation states that no sorting occurs after the Reduce task, or at the end of the Map task if Reduce is not used, so this approach seems unlikely to work. Would I instead need to define a new InputFormat, or a new RecordReader, in order to create meaningful keys and corresponding values? Or is there another strategy (counters?) that I could use to accomplish mapping the line numbers of the CSV files to rows in the ensuing matrix?

Thanks in advance!

Regards,
Shannon

Reply via email to