Hi That's a nice problem :)
I would probably use a serialised form that allow accessing the data without deserialising. Like SBE. This way there is no extra allocation, just the byte stream. Then I would maybe try to divide and conquer: shard objecs in buckets. Example In case of strings, a bucket for starting with A, a bucket for starting with B, etc... Then you can sort inside the buckets (in parallel?) And finish by concatenate the buckets. My 2 cents Good luck Georges On Fri, Nov 9, 2018, 16:08 Shevek <[email protected] wrote: > Hi, > > I'm trying to sort/merge a very large number of objects in Java, and > failing more spectacularly than normal. The way I'm doing it is this: > > * Read a bunch of objects into an array. > * Sort the array, then merge neighbouring objects as appropriate. > * Re-fill the array, re-sort, re-merge until compaction is "not very > successful". > * Dump the array to file, repeat for next array. > * Then stream all files through a final merge/combine phase. > > This is failing largely because I have no idea how large to make the > array. Estimating the ongoing size using something like JAMM is too > slow, and my hand-rolled memory estimator is too unreliable. > > The thing that seems to be working best is messing around with the array > size in order to keep some concept of runtime.maxMemory() - > runtime.totalMemory() + runtime.freeMemory() within a useful bound. > > But there must be a better solution. I can't quite think a way around > this with SoftReference because I need to dump the data to disk when the > reference gets broken, and defeating me right now. > > Other alternatives would include keeping all my in-memory data > structures in serialized form, and paying the ser/deser cost to compare, > but that's expensive - my main overhead right now is gc. Serialization > is protobuf, although that's changeable, since it's annoying the hell > out of me (please don't say thrift - but protobuf appears to have no way > to read from a stream into a reusable object - it has to allocate the > world every single time). > > Issues: > * This routine is not the sole tenant of the JVM. Other things use RAM. > * This has to be deployed and work on systems whose memory config is > unknown to me. > > Can anybody please give me pointers? > > S. > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
