[
https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081617#comment-13081617
]
XiaoboGu commented on MAHOUT-781:
---------------------------------
Here are the purposes of making Csv2Seq
1. Almost all raw data are in csv format, which many Mahout algorithms can't
consume directly, a command line tool will save users writing/compiling and
debugging java codes every time when he deals with a new csv file schema.
2. To take advantage of Map-Reduce to do data transformation quickly, this is a
bottleneck in many algorithms such as AdaptiveLogisticRegression.
For your spec related question, here are the answers:
1. This is just a utility tool, the core functionality of dealing with CSV
schema comes with CsvRecordFactory, which is in Mahout-core, it is the
CsvRecordFactory's responsibility to deal with that many problems you have
asked, and I found CsvRecordFactory does a great job.
2. In many practical use cases, CSV header and even the predictor columns and
their types are too long to write in the command line, letting them stored in
separate local file, and let the tool read them is a more convenient way.
3. Here are the command line parameters specification:
input :the HDFS path of the csv files to convert
output : the target HDFS directory to write the target sequence file
target : the name of the target/label variable/column
header : a local file path contain the csv header content
key : the name of key variable/column
predictors : a list of predictor variables
types : a list of predictor variable types (numeric, word, or text)
categories : the number of target categories to be considered
features : the number of internal hashed features to use
target, key, predictors, types, and categories will be used to make a
CsvRecordFactory object inside each Csv2SeqMapper object, before processing
records, the mapper object will call CsvRecordFactoy’s firstline method, with
the csv header content as the parameter, then every record will be processed by
CsvRecordFactory’s processLine method. A RandomAccessSparseVector(numFeatures)
object will be created for each record, then the resulting Vector will be
written to Sequence file, new Text(key) will be the SequenceFile key, if key is
not specified by user, then target will be used as key, this is the exact input
SequenceFile format required by Naïve Bayes.
> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
> Key: MAHOUT-781
> URL: https://issues.apache.org/jira/browse/MAHOUT-781
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Affects Versions: 0.6
> Reporter: XiaoboGu
> Priority: Minor
> Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira