Re: Sorting a very large number of objects

Shevek Sun, 11 Nov 2018 21:53:57 -0800

I'm using the serialized approach with an ergonomics approach to heapmanagement. protobuf really sucks for a lot of reasons which are easilyremediable if one is a protobuf maintainer, but I'm not and I don't careto be, and I don't have the time or inclination to write a new serializer.

However, and to Gil's point, following the ergonomics approach, when Iresize an array, I drop a garbage-collectable array which I no longer"own", so from the PoV of the memory management code, it forms part ofthe "other" part of the heap, i.e. I have to subtract it from the totalmemory before computing the proportion of the heap I'm allowed to use.Over time, given that GC isn't timely, I can lose all my heap to thesecollectable arrays, and if I follow the rule about "Never callSystem.gc()" this actually isn't a very efficient approach either.

The next best approach is to allocate all my memory in 1Mb blocks, makeeach memory "region" be backed by some list of these blocks, so I neveractually hand them to the GC, and just reduce my overall memory usage bydropping blocks, rather than totally resizing byte arrays.

My main disbelief is that I'm inventing all of this from scratch. It HASto have been done before, right? All I want to do is sort a load ofobjects...!


S.

The presumed solution to this

On 11/10/18 8:51 AM, Gil Tene wrote:

On Friday, November 9, 2018 at 7:08:23 AM UTC-8, Shevek wrote:

    Hi,

    I'm trying to sort/merge a very large number of objects in Java, and
    failing more spectacularly than normal. The way I'm doing it is this:

    * Read a bunch of objects into an array.
    * Sort the array, then merge neighbouring objects as appropriate.
    * Re-fill the array, re-sort, re-merge until compaction is "not very
    successful".
    * Dump the array to file, repeat for next array.
    * Then stream all files through a final merge/combine phase.

    This is failing largely because I have no idea how large to make the
    array. Estimating the ongoing size using something like JAMM is too
    slow, and my hand-rolled memory estimator is too unreliable.

    The thing that seems to be working best is messing around with the
    array
    size in order to keep some concept of runtime.maxMemory() -
    runtime.totalMemory() + runtime.freeMemory() within a useful bound.

    But there must be a better solution. I can't quite think a way around
    this with SoftReference because I need to dump the data to disk when
    the
    reference gets broken, and defeating me right now.

    Other alternatives would include keeping all my in-memory data
    structures in serialized form, and paying the ser/deser cost to
    compare,
    but that's expensive - my main overhead right now is gc. Serialization
    is protobuf, although that's changeable, since it's annoying the hell
    out of me (please don't say thrift - but protobuf appears to have no
    way
    to read from a stream into a reusable object - it has to allocate the
    world every single time).
In general, whenever I see "my overhead is gc" and "unknown memory size"together, I see it as a sign of someone pushing heap utilization highand getting into the inefficient GC state. Simplistically, you should beable to drop the GC cost to an arbitrary % of overall computation costby increasing the amount (or relative portion) of empty heap in yourset. So GC should never be "a bottleneck" from a throughput point ofview unless you have constraints (such as a minimum required live setand a maximum possible heap size) that force you towards a highutilization of the heap (in terms of LiveSet/HeapSize). The answer tosuch a situation is generally "get some more RAM for this problem"rather than put in tons of work to fit this in".
For most GC algorithms, or at least for the more costly parts of suchalgorithms, GC efficiency is roughly linear to EmptyHeap/LiveSet. Statedotherwise, GC cost grows with LiveSet/EmptyHeap orLiveSet/(HeapSize-LiveSet). As you grow the amount you try to cram intoa heap of a given size, you increase the GC cost to the square of yourcramming efforts. And for every doubling of the empty heap [for a givenlive set] you will generally half the GC cost.
This should [hopefully] make it obvious why using SoftReferences is agenerally terrible idea.
    Issues:
    * This routine is not the sole tenant of the JVM. Other things use RAM.
You can try to establish what an "efficient enough" heap utilizationlevel is for your use case (a level that keeps overall GC work as a % ofCPU spend to e.g. below 10%), and keep your heap use to a relatedfraction of whatever heap size you get to have on the system you land on.
    * This has to be deployed and work on systems whose memory config is
    unknown to me.

    Can anybody please give me pointers?

    S.

--
You received this message because you are subscribed to the GoogleGroups "mechanical-sympathy" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Sorting a very large number of objects

Reply via email to