[ 
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012634#comment-13012634
 ] 

Dmitriy Lyubimov commented on MAHOUT-633:
-----------------------------------------

bq. And, in about half the cases, the caller is cloning the key and/or value 
because it wants to save a copy. So in some cases it's already making new 
objects.

Yes, that's true. We can't prevent ppl from screwing it over. We only can given 
them a chance not to. And I want that chance for myself.

bq. I had in mind that this factor is probably dwarfed by I/O and the actual 
deserialization... right? I had the impression these Hadoop jobs were most 
certainly I/O bound, not memory/GC/CPU bound.

Not in SSVD, it packs parts of massive scale QR and and stochastic projection 
in one map step and i had it 98.8% avg CPU saturation. Which basically told me 
i wasn't wasteful on I/O -- which I tried pretty hard not to be. QR algorithms 
are quadratic -- even that we reduce the scale of the problem. I am still a 
little bit wasteful on flops here but it's not dramatic and it got to be enough 
for open source. So this near-limit memory use GC stuff will affect me very 
very much (i build a series of jobs with similar dynamics before in java 1.5, 
it was pretty bad (up to 50 times slower) until i employed the strategies i 
told about above).


> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
>                 Key: MAHOUT-633
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-633
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: iterable, iterator, sequence-file
>             Fix For: 0.5
>
>         Attachments: MAHOUT-633.patch, MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which 
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd 
> like to add it, then use it throughout the code. See patch, which for now 
> merely has the proposed new classes. 
> Well it also moves some other iterator-related classes that seemed to be 
> outside their rightful home in common.iterator.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to