Hi all,
I apologize to anyone on the common-dev list, as I mistakenly posted
this question there first.
I am a GSoC student working on the Mahout project, but right now I am
having difficulty employing the Hadoop map/reduce API for reading my
data into the program in the first place. Specifically, I am wondering
about generating SequenceFiles from CSV files. The CSV files I am
interested in are matrix representations; each line corresponds to a
row, and each comma-separated value corresponds to a column. I know that
TextInputFormat will split according to each newline, but the key
provided is the byte offset, rather than the line number. Ideally, I'd
like to generate a Vector of each CSV row's elements and use the line
number as its key.
However, this byte offset could still be useful if, at the end of the
M/R task I could sort all the Vectors according to their keys and use
that ordering as the matrix. The documentation states that no sorting
occurs after the Reduce task, or at the end of the Map task if Reduce is
not used, so this approach seems unlikely to work. Would I instead need
to define a new InputFormat, or a new RecordReader, in order to create
meaningful keys and corresponding values? Or is there another strategy
(counters?) that I could use to accomplish mapping the line numbers of
the CSV files to rows in the ensuing matrix?
Thanks in advance!
Regards,
Shannon