[
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011904#comment-13011904
]
Dmitriy Lyubimov commented on MAHOUT-633:
-----------------------------------------
{quote}
I think that creating a new Writable each time is probably a good thing on the
whole. I doubt seriously that it will be any slower as long as it avoids
unnecessary copying of a large structure. If you avoid large copies then new
allocation can actually be better than re-use since new allocation keeps the
garbage reclamation work in the newspace collector which is considerably better
than letting stuff get copied several times and possibly even tenured.{quote}
so we are clear, we are talking
{code:title=A}
Writable w = ReflectionUtil.newIntance(...)
for ( i = 0 ; i< 100000000; i++ ) {
... do something with w ...
}
{code}
vs.
{code:title=B}
for ( i = 0 ; i< 100000000; i++ ) {
Writable w = ReflectionUtil.newIntance(...)
... do something with w ...
}
{code}
... and you essentually saying that B is faster than A. I am not a GC expert so
i reserve right to be wrong. but i am dubious about it because of my benchmarks
and simulations i ran on java.
1) It's one writable (a handful references at most) vs. young reference in
every iteration. Sure, YGC and allocation are fast, but I am kind of dubious it
is faster than not doing anything at all. In fact, we want it to tenure and
even go to permanent pool so YGC 'forgets' about it and not check it any more,
but even if it doesn't happen we don't care much since it's only one iteration
for GC. Overhead, as far as i understand, in this case only happens during full
gcs which would be rarer than YGC.
2) what i witness usually is that dynamics changes quite a bit when you
approach the memory limit. Full GCs are happenning in that situation more
frequently (perhaps you can fight that by decreasing young GC space but i
couldn't and it's still a hassle to tune). Full GCs are also quite long
running. In fact, this problem is so bad that HBase folks told me they had
cases when Full GC even with 12G caused pauses long enough to break 40 second
zookeeper session to a region node (causing node die-offs). So they in fact
recommend longer zk sessions for higher RAMs just because of GC!
So either you don't approach the limit (i.e. not use all RAM you paid for) or
preallocate stuff and let it tenure without adding much new in the mix.
In practice what i found that allocation of new objects that don't leave YGC is
indeed better in my benchmarks than yanking them from either pessimistic or
even optimistic object pool (with optimistic pools, i.e. those running on
Compare-And-Swap operations those using surprisingly being only marginally
better) but things are starting to change quite dramatically as soon as you try
to fill in say 5G out of 6 and factor in all tenured and full gc overhead.
Actually in practice for real time applications the best practice i found is to
run java processes at ~300M with the rest of the system dedicated to I/O cache
which keeps my memory mapped structures (btrees and such). java heaps are only
used to allocate a handful of long lived and reused object trees to walk the
memory-mapped structures. That combination seems to be unbeatable so far.
Kernel manages 'bulk' memory and you manage TWAs. And even if the rest of the
processing is causing full GCs, they are unlikely to be catastrophic on that
size of heap. but that's of course is not applicable to batches.
In general, i would be very greatful if somebody could give me hint how to
fight GC thrashing in near-full RAM situations, but in any event i think i
doubt that it would be preferring code B over A.
> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
> Key: MAHOUT-633
> URL: https://issues.apache.org/jira/browse/MAHOUT-633
> Project: Mahout
> Issue Type: Improvement
> Components: Classification, Clustering, Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Sean Owen
> Assignee: Sean Owen
> Priority: Minor
> Labels: iterable, iterator, sequence-file
> Fix For: 0.5
>
> Attachments: MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd
> like to add it, then use it throughout the code. See patch, which for now
> merely has the proposed new classes.
> Well it also moves some other iterator-related classes that seemed to be
> outside their rightful home in common.iterator.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira