Re: Sorting a very large number of objects

Shevek Sun, 20 Jan 2019 01:10:58 -0800

This project is very much in-progress.

We need to sort about 1e13 records, several terabytes when compressed,sort-merge, and end up with about 1e10 in sqlite. Right now, we arerunning sqlite with 1e9 objects, and it isn't an issue. sqlite is muchbetter than one would naively believe it to be, if used appropriately.Oddly enough, its VM is several times faster than pg, for IO-free rawmathematical computation, too.

Our current bottleneck is the serialization and allocation overhead ofprotobuf. Many of the serializers recommended on this list can onlyserialize fixed-size structures, but we're working on an implementationwith flatbuf right now. Thank you, Georges. flatbuf will also permit usto avoid having a separate serialized copy of the sort-key.

We are going to experiment with reading files via mmap rather than I/O,but we have not yet done so. It's tempting to find some way to callmadvise(SEQUENTIAL) on the mmap. Not sure what the other effects arelikely to be, however, but it may help us keep most/all of the dataeffectively off-heap during the merge phase.


We have mastered all the (currently known) GC issues, thank you, Gil.

Accessing the objects fast by id is not currently possible, althoughit's definitely an angle we could pursue. A major purpose of this sortis to merge identical objects, or data under the same key, so even if wedid store by id, it would have to be a mutable store, which would haveits own issues.

We started with https://github.com/cowtowncoder/java-merge-sort andassumed that due to the simplicity of that implementation, it would beeasy to do better, but it turns out that the simplicity of thatparticular implementation is not actually a significant limiting factor.However, it turns out that once one has done the serialization, a customversion of Guava's Iterators.mergeSorted() is somewhat better.


S.


On 1/19/19 3:28 PM, Steven Stewart-Gallus wrote:

I'm really confused.
You're talking about putting the data into sqlite which suggests therereally isn't so much log data and it could be filtered with a hackyshell script. But then you're talking about a lot of heavy optimisationwhich suggests you really may need to put in custom effort. Preciselyhow much log data really needs to be filtered? You're unlikely to beable to filter much of the data faster than the system utilities whichare often very old and well-optimised C code. I'm reminded about the oldstory of the McIlroy and Knuth word count programs.
Anyway while this is a very enlightening discussion it is probablyworthwhile to reuse as much existing system utilities and code as youcan instead of writing your own.
--
You received this message because you are subscribed to the GoogleGroups "mechanical-sympathy" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Sorting a very large number of objects

Reply via email to