[jira] [Commented] (MAHOUT-633) Add SequenceFileIterable; put Iterable stuff in one place

Dmitriy Lyubimov (JIRA) Sun, 27 Mar 2011 20:36:54 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011904#comment-13011904
 ]


Dmitriy Lyubimov commented on MAHOUT-633:
-----------------------------------------

{quote}
I think that creating a new Writable each time is probably a good thing on the 
whole. I doubt seriously that it will be any slower as long as it avoids 
unnecessary copying of a large structure. If you avoid large copies then new 
allocation can actually be better than re-use since new allocation keeps the 
garbage reclamation work in the newspace collector which is considerably better 
than letting stuff get copied several times and possibly even tenured.{quote}

so we are clear, we are talking 
{code:title=A}
Writable w = ReflectionUtil.newIntance(...)
for ( i = 0 ; i< 100000000; i++ ) { 
  ... do something with w ... 
}
{code}
vs.
{code:title=B}
for ( i = 0 ; i< 100000000; i++ ) { 
  Writable w = ReflectionUtil.newIntance(...)
  ... do something with w ... 
}
{code}

... and you essentually saying that B is faster than A. I am not a GC expert so 
i reserve right to be wrong. but i am dubious about it because of my benchmarks 
and simulations i ran on java.

1) It's one writable (a handful references at most) vs. young reference in 
every iteration. Sure, YGC and allocation are fast, but I am kind of dubious it 
is faster than not doing anything at all. In fact, we want it to tenure and 
even go to permanent pool so YGC 'forgets' about it and not check it any more, 
but even if it doesn't happen we don't care much since it's only one iteration 
for GC. Overhead, as far as i understand, in this case only happens during full 
gcs which would be rarer than YGC.

2) what i witness usually is that dynamics changes quite a bit when you 
approach the memory limit. Full GCs are happenning in that situation more 
frequently (perhaps you can fight that by decreasing young GC space but i 
couldn't and it's still a hassle to tune). Full GCs are also quite long 
running. In fact, this problem is so bad that HBase folks told me they had 
cases when Full GC even with 12G caused pauses long enough to break 40 second 
zookeeper session to a region node (causing node die-offs). So they in fact 
recommend longer zk sessions for higher RAMs just because of GC! 

So either you don't approach the limit (i.e. not use all RAM you paid for) or 
preallocate stuff and let it tenure without adding much new in the mix. 

In practice what i found that allocation of new objects that don't leave YGC is 
indeed better in my benchmarks than yanking them from either pessimistic or 
even optimistic object pool (with optimistic pools, i.e. those running on 
Compare-And-Swap operations those using  surprisingly being only marginally 
better) but things are starting to change quite dramatically as soon as you try 
to fill in say 5G out of 6 and factor in all tenured and full gc overhead. 

Actually in practice for real time applications the best practice i found is to 
run java processes at ~300M with the rest of the system dedicated to I/O cache 
which keeps my memory mapped structures (btrees and such). java heaps are only 
used to allocate a handful of long lived and reused object trees to walk the 
memory-mapped structures. That combination seems to be unbeatable so far. 
Kernel manages 'bulk' memory and you manage TWAs. And even if the rest of the 
processing is causing full GCs, they are unlikely to be catastrophic on that 
size of heap. but that's of course is not applicable to batches. 

In general, i would be very greatful if somebody could give me hint how to 
fight GC thrashing in near-full RAM situations, but in any event i think i 
doubt that it would be preferring code B over A.



> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
>                 Key: MAHOUT-633
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-633
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: iterable, iterator, sequence-file
>             Fix For: 0.5
>
>         Attachments: MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which 
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd 
> like to add it, then use it throughout the code. See patch, which for now 
> merely has the proposed new classes. 
> Well it also moves some other iterator-related classes that seemed to be 
> outside their rightful home in common.iterator.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-633) Add SequenceFileIterable; put Iterable stuff in one place

Reply via email to