Hi

That's a nice problem :)

I would probably use a serialised form that allow accessing the data
without deserialising. Like SBE. This way there is no extra allocation,
just the byte stream.

Then I would maybe try to divide and conquer: shard objecs in buckets.
Example In case of strings, a bucket for starting with A, a bucket for
starting with B, etc...
Then you can sort inside the buckets (in parallel?) And finish by
concatenate the buckets.

My 2 cents
Good luck
Georges



On Fri, Nov 9, 2018, 16:08 Shevek <[email protected] wrote:

> Hi,
>
> I'm trying to sort/merge a very large number of objects in Java, and
> failing more spectacularly than normal. The way I'm doing it is this:
>
> * Read a bunch of objects into an array.
> * Sort the array, then merge neighbouring objects as appropriate.
> * Re-fill the array, re-sort, re-merge until compaction is "not very
> successful".
> * Dump the array to file, repeat for next array.
> * Then stream all files through a final merge/combine phase.
>
> This is failing largely because I have no idea how large to make the
> array. Estimating the ongoing size using something like JAMM is too
> slow, and my hand-rolled memory estimator is too unreliable.
>
> The thing that seems to be working best is messing around with the array
> size in order to keep some concept of runtime.maxMemory() -
> runtime.totalMemory() + runtime.freeMemory() within a useful bound.
>
> But there must be a better solution. I can't quite think a way around
> this with SoftReference because I need to dump the data to disk when the
> reference gets broken, and defeating me right now.
>
> Other alternatives would include keeping all my in-memory data
> structures in serialized form, and paying the ser/deser cost to compare,
> but that's expensive - my main overhead right now is gc. Serialization
> is protobuf, although that's changeable, since it's annoying the hell
> out of me (please don't say thrift - but protobuf appears to have no way
> to read from a stream into a reusable object - it has to allocate the
> world every single time).
>
> Issues:
> * This routine is not the sole tenant of the JVM. Other things use RAM.
> * This has to be deployed and work on systems whose memory config is
> unknown to me.
>
> Can anybody please give me pointers?
>
> S.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to