[ 
https://issues.apache.org/jira/browse/MAHOUT-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022152#comment-13022152
 ] 

Stanley Xu commented on MAHOUT-677:
-----------------------------------

Hi Sean,

First, most of the time spent is not on IO or even parsing, but spent on copy a 
String over and over, which is also mentioned in your book <Mahout In Action>, 
and I reproduced the same result by a test. The code cost most of the time is :
encoder[i].addToVector(x.get(i), v);

Which copied a String once and have to generate the hashCode once again I 
guess. So what should be avoid is the conversion to a String in parsing the 
data actually for SGD algorithm.

I agreed that the performance could be optimized more by a customized binary 
input format. But I thought the example here is good enough since it proved the 
idea and easy to read. Using a customized binary format might make the code or 
data hard to read, and a binary protocol like Thrift is even slower while 
parsing the data comparing to a customized parser by pure text per my 
experience.

Anyway,  it is your call, why don't you ask the author of the Chapter 16.3.4 of 
<Mahout In Action> to decide you guys need a better example or just use the 
patch here?

> The SimpleCsvExamples didn't really parsed the double correctly with the 
> FastLine and FastLineReader
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-677
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-677
>             Project: Mahout
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 0.5
>            Reporter: Stanley Xu
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: simplecsvexamplebugfix.diff
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The FastLineReader in SimpleCsvExamples.java try to parse the line quickly 
> through parse the bytes directly from the stream without the cost of copy 
> Strings. But it didn't parse the line correctly and will get all double 
> values as zero in fast parsing mode

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to