caches and soft references

Siva Velusamy Thu, 15 Nov 2018 10:22:30 -0800

In the thread below, gil@ makes the following statement:

> For most GC algorithms, or at least for the more costly parts of such
algorithms,
> GC efficiency is roughly linear to EmptyHeap/LiveSet. ...
> This should [hopefully] make it obvious why using SoftReferences is a
generally terrible idea.

I'm not following that conclusion. Is that because SoftReferenced objects
are still considered to be part of the LiveSet in the calculation above,
and that leads to increased GC cost?

As a follow up, what is the recommended way to cache objects? Currently
many places in our codebase use Guava's CacheBuilder with softValues
<https://google.github.io/guava/releases/snapshot/api/docs/com/google/common/cache/CacheBuilder.html#softValues-->.
Does it even make sense to use weakValues, or is the suggestion to just use
strong values, but try to restrict the number of entries?

---------- Forwarded message ---------
From: Gil Tene <[email protected]>
Date: Sat, Nov 10, 2018 at 8:51 AM
Subject: Re: Sorting a very large number of objects
To: mechanical-sympathy <[email protected]>

On Friday, November 9, 2018 at 7:08:23 AM UTC-8, Shevek wrote:
>
> Hi,
>
> I'm trying to sort/merge a very large number of objects in Java, and
> failing more spectacularly than normal. The way I'm doing it is this:
>
> * Read a bunch of objects into an array.
> * Sort the array, then merge neighbouring objects as appropriate.
> * Re-fill the array, re-sort, re-merge until compaction is "not very
> successful".
> * Dump the array to file, repeat for next array.
> * Then stream all files through a final merge/combine phase.
>
> This is failing largely because I have no idea how large to make the
> array. Estimating the ongoing size using something like JAMM is too
> slow, and my hand-rolled memory estimator is too unreliable.
>
> The thing that seems to be working best is messing around with the array
> size in order to keep some concept of runtime.maxMemory() -
> runtime.totalMemory() + runtime.freeMemory() within a useful bound.
>
> But there must be a better solution. I can't quite think a way around
> this with SoftReference because I need to dump the data to disk when the
> reference gets broken, and defeating me right now.
>
> Other alternatives would include keeping all my in-memory data
> structures in serialized form, and paying the ser/deser cost to compare,
> but that's expensive - my main overhead right now is gc. Serialization
> is protobuf, although that's changeable, since it's annoying the hell
> out of me (please don't say thrift - but protobuf appears to have no way
> to read from a stream into a reusable object - it has to allocate the
> world every single time).
>

In general, whenever I see "my overhead is gc" and "unknown memory size"
together, I see it as a sign of someone pushing heap utilization high and
getting into the inefficient GC state. Simplistically, you should be able
to drop the GC cost to an arbitrary % of overall computation cost by
increasing the amount (or relative portion) of empty heap in your set. So
GC should never be "a bottleneck" from a throughput point of view unless
you have constraints (such as a minimum required live set and a maximum
possible heap size) that force you towards a high utilization of the heap
(in terms of LiveSet/HeapSize). The answer to such a situation is generally
"get some more RAM for this problem" rather than put in tons of work to fit
this in".

For most GC algorithms, or at least for the more costly parts of such
algorithms, GC efficiency is roughly linear to EmptyHeap/LiveSet. Stated
otherwise, GC cost grows with LiveSet/EmptyHeap or
LiveSet/(HeapSize-LiveSet). As you grow the amount you try to cram into a
heap of a given size, you increase the GC cost to the square of your
cramming efforts. And for every doubling of the empty heap [for a given
live set] you will generally half the GC cost.

This should [hopefully] make it obvious why using SoftReferences is a
generally terrible idea.

>
> Issues:
> * This routine is not the sole tenant of the JVM. Other things use RAM.
>

You can try to establish what an "efficient enough" heap utilization level
is for your use case (a level that keeps overall GC work as a % of CPU
spend to e.g. below 10%), and keep your heap use to a related fraction of
whatever heap size you get to have on the system you land on.

> * This has to be deployed and work on systems whose memory config is
> unknown to me.
>
> Can anybody please give me pointers?
>
> S.
>
-- 
You received this message because you are subscribed to the Google Groups
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

caches and soft references

Reply via email to