We are doing sorting by proxy. Right now I have a byte[] serialized as:
[sort-key0, data0, sort-key1, data1, ...]
and a comparator which can compare either key-bytes or values-by-key. We
then sort a separate integer array which points to each sort-key's
offset in the underlying array, then we emit by walking the integer array.
The challenge is to merge duplicate neighbouring objects by key, then
emit in order, so any proxy method which requires me to do an
index-lookup for an object on disk will die horribly in seek. So we are
sorting each block in a (now fairly appropriately-sized) byte array,
dumping that to file, then stream-merging files.
We still need another two orders of magnitude of performance, but at
least we're back to the profiler for a new hypothesis now.
I think to improve from here, we will need to (a) use EWMA to compute
appropriate buffer sizes per buffer, since throughput is not uniform,
and (b) use an interface to front a set of 4Mb-sized byte[] arrays
rather than using a single 1Gb array, so that (b1) we can exceed 2Gb,
and (b2) we can allocate and free at a finer granularity.
The garbage collector is no longer a significant participant in our
computation, but protobuf still has disappointingly bad mechanical
sympathy for stream processing, and the cost of <init> of protobuf
objects is currently an unavoidably large percentage of runtime.
S.
On 11/12/18 6:24 AM, Mindaugas Žakšauskas wrote:
Hi,
Do you require the entire object to be loaded into memory in order to
compare it with another object? Do these objects have IDs and could be
accessed by IDs quickly after sorting? If so, you could derive a
lightweight proxy only containing few attributes of such object and work
with those, reducing the amount of heap needed. After the lightweights
are sorted, you would know the order number of each one, and in turn,
its parent.
If you can't extract a lightweight attribute subset, perhaps you can
come up with some sort of universal object score for each object and
work with that?
m.
On Friday, 9 November 2018 15:08:23 UTC, Shevek wrote:
Hi,
I'm trying to sort/merge a very large number of objects in Java, and
failing more spectacularly than normal. The way I'm doing it is this:
* Read a bunch of objects into an array.
* Sort the array, then merge neighbouring objects as appropriate.
* Re-fill the array, re-sort, re-merge until compaction is "not very
successful".
* Dump the array to file, repeat for next array.
* Then stream all files through a final merge/combine phase.
This is failing largely because I have no idea how large to make the
array. Estimating the ongoing size using something like JAMM is too
slow, and my hand-rolled memory estimator is too unreliable.
The thing that seems to be working best is messing around with the
array
size in order to keep some concept of runtime.maxMemory() -
runtime.totalMemory() + runtime.freeMemory() within a useful bound.
But there must be a better solution. I can't quite think a way around
this with SoftReference because I need to dump the data to disk when
the
reference gets broken, and defeating me right now.
Other alternatives would include keeping all my in-memory data
structures in serialized form, and paying the ser/deser cost to
compare,
but that's expensive - my main overhead right now is gc. Serialization
is protobuf, although that's changeable, since it's annoying the hell
out of me (please don't say thrift - but protobuf appears to have no
way
to read from a stream into a reusable object - it has to allocate the
world every single time).
Issues:
* This routine is not the sole tenant of the JVM. Other things use RAM.
* This has to be deployed and work on systems whose memory config is
unknown to me.
Can anybody please give me pointers?
S.
--
You received this message because you are subscribed to the Google
Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.