[ 
https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081617#comment-13081617
 ] 

XiaoboGu commented on MAHOUT-781:
---------------------------------

Here are the purposes of making Csv2Seq 
1. Almost all raw data are in csv format, which many Mahout algorithms can't 
consume directly, a command line tool will save users writing/compiling and 
debugging java codes every time when he deals with a new csv file schema.
2. To take advantage of Map-Reduce to do data transformation quickly, this is a 
bottleneck in many algorithms such as AdaptiveLogisticRegression.

For your spec related question, here are the answers:
1. This is just a utility tool, the core functionality of dealing with CSV 
schema comes with CsvRecordFactory, which is in Mahout-core, it is the 
CsvRecordFactory's responsibility to deal with that many problems you have 
asked, and I found CsvRecordFactory does a great job.
2. In many practical use cases, CSV header and even the predictor columns and 
their types are too long to write in the command line, letting them stored in 
separate local file, and let the tool read them is a more convenient way.
3. Here are the command line parameters specification:
input :the HDFS path of the csv files to convert
output : the target HDFS directory to write the target sequence file
target : the name of the target/label variable/column
header : a local file path contain the csv header content
key : the name of key variable/column
predictors : a list of predictor variables
types : a list of predictor variable types (numeric, word, or text)
categories : the number of target categories to be considered
features :  the number of internal hashed features to use

target, key, predictors, types, and categories will be used to make a 
CsvRecordFactory object inside each Csv2SeqMapper object, before processing 
records, the mapper object will call CsvRecordFactoy’s firstline method, with 
the csv header content as the parameter, then every record will be processed by 
CsvRecordFactory’s processLine method. A RandomAccessSparseVector(numFeatures) 
object will be created for each record, then the resulting Vector will be 
written to Sequence file, new Text(key) will be the SequenceFile key, if key is 
not specified by user, then target will be used as key, this is the exact input 
SequenceFile format required by Naïve Bayes.



> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to