Re: Sorting a very large number of objects

Shevek Wed, 14 Nov 2018 23:19:23 -0800

It's a batch job. We are taking log files from N days, merging recordsby name, and building a sqlite database out of them. So the time tofirst record isn't important at all, it's entirely about time tocompletion for the whole dataset. What we have is effectively an N-waymap-reduce in Java.

We need another two orders of magnitude in performance, but I can see aset of paths which ought to get us at least one.

There is also the option to swap out sqlite for C* or scylla orsomething, if we think that the distributed update will be better than alocal sort-merge, but this isn't currently clear, and it isn't going tohappen this cycle.


S.

On 11/14/18 3:51 PM, Michael Meehan wrote:

Is the thing that's consuming your list a stream-consumer? Does theconsumer for the next operation need the whole sorted list, or just thefirst element of the sorted list (the largest/smallest)? If both ofthose are true, it's most important that you deliver the first elementquickly, and achieve reasonable throughput (not slower than the nextconsumer in the stream can process) after. In that case, you don't wantto take a lot of time to generate a fully sorted list, you want to takea very small time to find the first element to pass on!

On Mon, Nov 12, 2018 at 4:57 PM Shevek <[email protected]<mailto:[email protected]>> wrote:


    We are doing sorting by proxy. Right now I have a byte[] serialized as:

    [sort-key0, data0, sort-key1, data1, ...]

    and a comparator which can compare either key-bytes or
    values-by-key. We
    then sort a separate integer array which points to each sort-key's
    offset in the underlying array, then we emit by walking the integer
    array.

    The challenge is to merge duplicate neighbouring objects by key, then
    emit in order, so any proxy method which requires me to do an
    index-lookup for an object on disk will die horribly in seek. So we are
    sorting each block in a (now fairly appropriately-sized) byte array,
    dumping that to file, then stream-merging files.

    We still need another two orders of magnitude of performance, but at
    least we're back to the profiler for a new hypothesis now.

    I think to improve from here, we will need to (a) use EWMA to compute
    appropriate buffer sizes per buffer, since throughput is not uniform,
    and (b) use an interface to front a set of 4Mb-sized byte[] arrays
    rather than using a single 1Gb array, so that (b1) we can exceed 2Gb,
    and (b2) we can allocate and free at a finer granularity.

    The garbage collector is no longer a significant participant in our
    computation, but protobuf still has disappointingly bad mechanical
    sympathy for stream processing, and the cost of <init> of protobuf
    objects is currently an unavoidably large percentage of runtime.

    S.

    On 11/12/18 6:24 AM, Mindaugas Žakšauskas wrote:
     > Hi,
     >
     > Do you require the entire object to be loaded into memory in
    order to
     > compare it with another object? Do these objects have IDs and
    could be
     > accessed by IDs quickly after sorting? If so, you could derive a
     > lightweight proxy only containing few attributes of such object
    and work
     > with those, reducing the amount of heap needed. After the
    lightweights
     > are sorted, you would know the order number of each one, and in
    turn,
     > its parent.
     >
     > If you can't extract a lightweight attribute subset, perhaps you can
     > come up with some sort of universal object score for each object and
     > work with that?
     >
     > m.
     >
     >
     > On Friday, 9 November 2018 15:08:23 UTC, Shevek wrote:
     >
     >     Hi,
     >
     >     I'm trying to sort/merge a very large number of objects in
    Java, and
     >     failing more spectacularly than normal. The way I'm doing it
    is this:
     >
     >     * Read a bunch of objects into an array.
     >     * Sort the array, then merge neighbouring objects as appropriate.
     >     * Re-fill the array, re-sort, re-merge until compaction is
    "not very
     >     successful".
     >     * Dump the array to file, repeat for next array.
     >     * Then stream all files through a final merge/combine phase.
     >
     >     This is failing largely because I have no idea how large to
    make the
     >     array. Estimating the ongoing size using something like JAMM
    is too
     >     slow, and my hand-rolled memory estimator is too unreliable.
     >
     >     The thing that seems to be working best is messing around
    with the
     >     array
     >     size in order to keep some concept of runtime.maxMemory() -
     >     runtime.totalMemory() + runtime.freeMemory() within a useful
    bound.
     >
     >     But there must be a better solution. I can't quite think a
    way around
     >     this with SoftReference because I need to dump the data to
    disk when
     >     the
     >     reference gets broken, and defeating me right now.
     >
     >     Other alternatives would include keeping all my in-memory data
     >     structures in serialized form, and paying the ser/deser cost to
     >     compare,
     >     but that's expensive - my main overhead right now is gc.
    Serialization
     >     is protobuf, although that's changeable, since it's annoying
    the hell
     >     out of me (please don't say thrift - but protobuf appears to
    have no
     >     way
     >     to read from a stream into a reusable object - it has to
    allocate the
     >     world every single time).
     >
     >     Issues:
     >     * This routine is not the sole tenant of the JVM. Other
    things use RAM.
     >     * This has to be deployed and work on systems whose memory
    config is
     >     unknown to me.
     >
     >     Can anybody please give me pointers?
     >
     >     S.
     >
     > --
     > You received this message because you are subscribed to the Google
     > Groups "mechanical-sympathy" group.
     > To unsubscribe from this group and stop receiving emails from it,
    send
     > an email to [email protected]
    <mailto:mechanical-sympathy%[email protected]>
     > <mailto:[email protected]
    <mailto:mechanical-sympathy%[email protected]>>.
     > For more options, visit https://groups.google.com/d/optout.

--You received this message because you are subscribed to the Google

    Groups "mechanical-sympathy" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected]
    <mailto:mechanical-sympathy%[email protected]>.
    For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the GoogleGroups "mechanical-sympathy" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Sorting a very large number of objects

Reply via email to