Re: Sorting a very large number of objects

Muruga Prasath Ganesan Tue, 12 Feb 2019 13:32:05 -0800

shevek,
Can you Please tell us the final solution that you implemented to fix the 
issue?


On Sunday, January 20, 2019 at 2:10:51 AM UTC-7, Shevek wrote:
>
> This project is very much in-progress. 
>
> We need to sort about 1e13 records, several terabytes when compressed, 
> sort-merge, and end up with about 1e10 in sqlite. Right now, we are 
> running sqlite with 1e9 objects, and it isn't an issue. sqlite is much 
> better than one would naively believe it to be, if used appropriately. 
> Oddly enough, its VM is several times faster than pg, for IO-free raw 
> mathematical computation, too. 
>
> Our current bottleneck is the serialization and allocation overhead of 
> protobuf. Many of the serializers recommended on this list can only 
> serialize fixed-size structures, but we're working on an implementation 
> with flatbuf right now. Thank you, Georges. flatbuf will also permit us 
> to avoid having a separate serialized copy of the sort-key. 
>
> We are going to experiment with reading files via mmap rather than I/O, 
> but we have not yet done so. It's tempting to find some way to call 
> madvise(SEQUENTIAL) on the mmap. Not sure what the other effects are 
> likely to be, however, but it may help us keep most/all of the data 
> effectively off-heap during the merge phase. 
>
> We have mastered all the (currently known) GC issues, thank you, Gil. 
>
> Accessing the objects fast by id is not currently possible, although 
> it's definitely an angle we could pursue. A major purpose of this sort 
> is to merge identical objects, or data under the same key, so even if we 
> did store by id, it would have to be a mutable store, which would have 
> its own issues. 
>
> We started with https://github.com/cowtowncoder/java-merge-sort and 
> assumed that due to the simplicity of that implementation, it would be 
> easy to do better, but it turns out that the simplicity of that 
> particular implementation is not actually a significant limiting factor. 
> However, it turns out that once one has done the serialization, a custom 
> version of Guava's Iterators.mergeSorted() is somewhat better. 
>
> S. 
>
>
> On 1/19/19 3:28 PM, Steven Stewart-Gallus wrote: 
> > I'm really confused. 
> > 
> > You're talking about putting the data into sqlite which suggests there 
> > really isn't so much log data and it could be filtered with a hacky 
> > shell script. But then you're talking about a lot of heavy optimisation 
> > which suggests you really may need to put in custom effort. Precisely 
> > how much log data really needs to be filtered? You're unlikely to be 
> > able to filter much of the data faster than the system utilities which 
> > are often very old and well-optimised C code. I'm reminded about the old 
> > story of the McIlroy and Knuth word count programs. 
> > 
> > Anyway while this is a very enlightening discussion it is probably 
> > worthwhile to reuse as much existing system utilities and code as you 
> > can instead of writing your own. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "mechanical-sympathy" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> > an email to [email protected] 
> <javascript:> 
> > <mailto:[email protected] <javascript:>>. 
>
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Sorting a very large number of objects

Reply via email to