[ 
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012724#comment-13012724
 ] 

Dmitriy Lyubimov commented on MAHOUT-633:
-----------------------------------------

bq.How much of that time was due to GC?

In the tests I ran, not significant. But i was using preprocessing handler on 
VectorWritable (the patch you guys were reluctant to accept) which did not 
create intermediate vector storage at all. All matrix elements were passed in 
on stack. Mahout's patch doesn't have this code but i will be happy to put it 
in jira for discussion. Also, i did not run close to memory limits on the tasks 
i ran with SSVD. I just don't have datasets that big. 

But i ran other code that did -- and like i said, running time losses were 
significant, up to order of magnitude (in jvm 1.5).

bq.Do you have any evidence that simpler techniques that cause ephemeral 
garbage are increasing your memory pressure?

I am not sure I understand this. If you are asking whether memory use per se is 
increased because of you are using tons of short lived references instead of 
one 'old' gen reference, no, i don't beleive that effect would be very 
significant so as we can construe it as being detrimental. It's just in 
near-limits memory use you'd vent out significantly more cpu on this, actually, 
surpirsingly a lot of cpu on 64bit systems with big heaps and a lot of 
references in them (>40 second global pauses, i.e. 4 times that on per-cpu 
basis, per full GC run on 12Gb), that's all. It's only apparent in jobs that 
need a lot of side info to run.


> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
>                 Key: MAHOUT-633
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-633
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: iterable, iterator, sequence-file
>             Fix For: 0.5
>
>         Attachments: MAHOUT-633.patch, MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which 
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd 
> like to add it, then use it throughout the code. See patch, which for now 
> merely has the proposed new classes. 
> Well it also moves some other iterator-related classes that seemed to be 
> outside their rightful home in common.iterator.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to