[ 
https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080769#comment-13080769
 ] 

Dan Brickley edited comment on MAHOUT-781 at 8/8/11 6:54 AM:
-------------------------------------------------------------

A utility does sound useful. Good idea Xiaobo.

I was happy to find Danny Bickson's post here - 
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html - 
which offers simple CSV importer. It takes sparse from/to/value affinity tuples 
and converts them into (and out of) a Mahout binary representation. Would 
MAHOUT-781 include this functionality?

It would be good to have a spec. There are lots of subtle variations on the CSV 
theme.

Can lines containing #-prefixed comments be included? Are extra blank lines 
acceptable or do they cause an error? Are header fields represented somehow 
inline, or only in a separate --header document? Is whitespace between field 
values discarded, included in the values we pass on, or considered invalid?

If the utility also covers conversion back to CSV, it should be possible to 
test round-tripping...

      was (Author: danbri):
    A utility does sound useful. Good idea Xiaobo.

I was happy to find Danny Bickson's post here - 
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html - 
which offers simple CSV importer. It takes sparse from/to/value affinity tuples 
and converts them into (and out of) a Mahout binary representation. Would 
MAHOUT-781 include this functionality?

It would be good to have a spec. There are lots of subtle variations on the CSV 
theme.

Can lines containing #-prefixed comments be included? Are extra blank lines 
acceptable or do they cause an error? Are header fields represented somehow 
inline, or only in a separate --header document? Is whitespace between field 
values discarded, included in the values we pass on, or silently discarded?

If the utility also covers conversion back to CSV, it should be possible to 
test round-tripping...
  
> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to