On Sep 22, 2019, at 15:28, Tim Peters <tim.pet...@gmail.com> wrote: > > > That's not by accident - the inspiration for CPython's sort's basic > "galloping" approach was taken from this paper, which wasn't about > sorting at all: > > "Adaptive Set Intersections, Unions, and Differences" (2000) > Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro > > That's concerned with manipulating large sets represented as sorted > sequences. They noticed too that data was "lumpy" in real life, and > that paper was the start of a number of papers trying ways to take > advantage of that. > > So if someone _really_ wants to go gonzo with this, that's a promising > approach. Do note, though, that it doesn't fit naturally with > iterators. "Galloping" relies on O(1) access to data at arbitrary > index positions; it's a fancier variant of binary search
I wonder if you could get any benefit by using something like a skiplist-rope hybrid, where you have effectively O(log n) random access rather than O(1), and can only copy huge chunks with O(log n) large memcopies instead of 1 huge one. Would those costs be enough to sink the benefits of galloping? It wouldn’t be usable in all cases, but there are many cases where your huge input is naturally iterable in this way (a set of files to concatenate, a huge file that you can seek in or mmap chunks of, maybe even a semi-lazy rope that you got from some Rust or Haskell library via ctypes). If you know the size of each iterator, you can build the first level or an approximate skiplist, and build the rest up on the fly to approximate binary search well enough (I think) for the purpose. Anyway, if I really did want to do these operations on massive datasets, I’d look at pandas+HDF5 to see what it can give me before considering implementing anything from scratch. But it could be a fun project… A few further half-formed thoughts on this: For a single huge file treated this way, you’d probably want to look at ripgrep. The author started off with the conventional wisdom on when to read vs. mmap vs. windowed-mmap and quickly discovered that all of the conventional wisdom was so badly out of date as to be useless… Could you get any benefit out of parallelizing the gallop seekaheads? Probably not for 1-10 million, but we are talking about huge datasets here. If you have an Iterator extended with a nextn method that can iterate a large chunk of data at a time (which a file can do if you’re treating it as a sequence of bytes or characters, but if you’re treating it as a sequence of lines it can’t, at least without a second pass…), you could build the same thing—but the cost is that you might easily end up using temporary N/2 memory, which defeats most of the benefits of using arbitrary iterables. C++, Swift, etc. don’t have a notion of a “bisectable Iterator”, where like a bidirectional Iterator it still takes O(m) time to jump forward or backward m steps, but it only takes O(1) time to jump (approximately?) halfway between two known positions. You could easily build that on top of anything random-access, and also on top of anything logarithmic like a tree or skiplist, and it seems like there are useful things you can do with it. Or maybe almost all such things are worth specifializing differently for random-access vs. inherently-logarithmic data structures, or maybe it’s just too hard to come up with a proper useful abstraction? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/XA3IKKH37A5I5MKBXKDBTYB7CPGPQ2TC/ Code of Conduct: http://python.org/psf/codeofconduct/