[ 
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012482#comment-13012482
 ] 

Sean Owen commented on MAHOUT-633:
----------------------------------

On new Writables: I agree, I don't think it can be faster to allocate many 
Writables. This is really the big possible problem with what this patch does. 

I found that in maybe 60% of cases the key is discarded, so, wrote special 
support for that which would not make a new key object on each iteration.

And, in about half the cases, the caller is cloning the key and/or value 
because it wants to save a copy. So in some cases it's already making new 
objects.

I had in mind that this factor is probably dwarfed by I/O and the actual 
deserialization... right? I had the impression these Hadoop jobs were most 
certainly I/O bound, not memory/GC/CPU bound.

There's a subtle argument for correctness too... it's a bit all-to-easy to 
forget to clone a key or value and get bitten with a subtle bug. But this is 
more of a minor practical argument.

And of course the whole point of the exercise is to standardize the code, since 
from touring the code it was clear there were many approaches (e.g. 
makeQualified() or not) and occasional correctness issues (e.g. no call to 
close()).

That's my reasoning. Thoughts on this thinking? I suppose my gut is that this 
isn't a big deal, but I don't have evidence.

> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
>                 Key: MAHOUT-633
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-633
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: iterable, iterator, sequence-file
>             Fix For: 0.5
>
>         Attachments: MAHOUT-633.patch, MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which 
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd 
> like to add it, then use it throughout the code. See patch, which for now 
> merely has the proposed new classes. 
> Well it also moves some other iterator-related classes that seemed to be 
> outside their rightful home in common.iterator.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to