shevek, Can you Please tell us the final solution that you implemented to fix the issue?
On Sunday, January 20, 2019 at 2:10:51 AM UTC-7, Shevek wrote: > > This project is very much in-progress. > > We need to sort about 1e13 records, several terabytes when compressed, > sort-merge, and end up with about 1e10 in sqlite. Right now, we are > running sqlite with 1e9 objects, and it isn't an issue. sqlite is much > better than one would naively believe it to be, if used appropriately. > Oddly enough, its VM is several times faster than pg, for IO-free raw > mathematical computation, too. > > Our current bottleneck is the serialization and allocation overhead of > protobuf. Many of the serializers recommended on this list can only > serialize fixed-size structures, but we're working on an implementation > with flatbuf right now. Thank you, Georges. flatbuf will also permit us > to avoid having a separate serialized copy of the sort-key. > > We are going to experiment with reading files via mmap rather than I/O, > but we have not yet done so. It's tempting to find some way to call > madvise(SEQUENTIAL) on the mmap. Not sure what the other effects are > likely to be, however, but it may help us keep most/all of the data > effectively off-heap during the merge phase. > > We have mastered all the (currently known) GC issues, thank you, Gil. > > Accessing the objects fast by id is not currently possible, although > it's definitely an angle we could pursue. A major purpose of this sort > is to merge identical objects, or data under the same key, so even if we > did store by id, it would have to be a mutable store, which would have > its own issues. > > We started with https://github.com/cowtowncoder/java-merge-sort and > assumed that due to the simplicity of that implementation, it would be > easy to do better, but it turns out that the simplicity of that > particular implementation is not actually a significant limiting factor. > However, it turns out that once one has done the serialization, a custom > version of Guava's Iterators.mergeSorted() is somewhat better. > > S. > > > On 1/19/19 3:28 PM, Steven Stewart-Gallus wrote: > > I'm really confused. > > > > You're talking about putting the data into sqlite which suggests there > > really isn't so much log data and it could be filtered with a hacky > > shell script. But then you're talking about a lot of heavy optimisation > > which suggests you really may need to put in custom effort. Precisely > > how much log data really needs to be filtered? You're unlikely to be > > able to filter much of the data faster than the system utilities which > > are often very old and well-optimised C code. I'm reminded about the old > > story of the McIlroy and Knuth word count programs. > > > > Anyway while this is a very enlightening discussion it is probably > > worthwhile to reuse as much existing system utilities and code as you > > can instead of writing your own. > > > > -- > > You received this message because you are subscribed to the Google > > Groups "mechanical-sympathy" group. > > To unsubscribe from this group and stop receiving emails from it, send > > an email to [email protected] > <javascript:> > > <mailto:[email protected] <javascript:>>. > > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
