On Sep 22, 2019, at 15:28, Tim Peters <tim.pet...@gmail.com> wrote:
> 
> 
> That's not by accident - the inspiration for CPython's sort's basic
> "galloping" approach was taken from this paper, which wasn't about
> sorting at all:
> 
>    "Adaptive Set Intersections, Unions, and Differences" (2000)
>    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
> 
> That's concerned with manipulating large sets represented as sorted
> sequences.  They noticed too that data was "lumpy" in real life, and
> that paper was the start of a number of papers trying ways to take
> advantage of that.
> 
> So if someone _really_ wants to go gonzo with this, that's a promising
> approach.  Do note, though, that it doesn't fit naturally with
> iterators.  "Galloping" relies on O(1) access to data at arbitrary
> index positions; it's a fancier variant of binary search

I wonder if you could get any benefit by using something like a skiplist-rope 
hybrid, where you have effectively O(log n) random access rather than O(1), and 
can only copy huge chunks with O(log n) large memcopies instead of 1 huge one. 
Would those costs be enough to sink the benefits of galloping?

It wouldn’t be usable in all cases, but there are many cases where your huge 
input is naturally iterable in this way (a set of files to concatenate, a huge 
file that you can seek in or mmap chunks of, maybe even a semi-lazy rope that 
you got from some Rust or Haskell library via ctypes). If you know the size of 
each iterator, you can build the first level or an approximate skiplist, and 
build the rest up on the fly to approximate binary search well enough (I think) 
for the purpose.

Anyway, if I really did want to do these operations on massive datasets, I’d 
look at pandas+HDF5 to see what it can give me before considering implementing 
anything from scratch. But it could be a fun project…

A few further half-formed thoughts on this:

For a single huge file treated this way, you’d probably want to look at 
ripgrep. The author started off with the conventional wisdom on when to read 
vs. mmap vs. windowed-mmap and quickly discovered that all of the conventional 
wisdom was so badly out of date as to be useless…

Could you get any benefit out of parallelizing the gallop seekaheads? Probably 
not for 1-10 million, but we are talking about huge datasets here.

If you have an Iterator extended with a nextn method that can iterate a large 
chunk of data at a time (which a file can do if you’re treating it as a 
sequence of bytes or characters, but if you’re treating it as a sequence of 
lines it can’t, at least without a second pass…), you could build the same 
thing—but the cost is that you might easily end up using temporary N/2 memory, 
which defeats most of the benefits of using arbitrary iterables.

C++, Swift, etc. don’t have a notion of a “bisectable Iterator”, where like a 
bidirectional Iterator it still takes O(m) time to jump forward or backward m 
steps, but it only takes O(1) time to jump (approximately?) halfway between two 
known positions. You could easily build that on top of anything random-access, 
and also on top of anything logarithmic like a tree or skiplist, and it seems 
like there are useful things you can do with it. Or maybe almost all such 
things are worth specifializing differently for random-access vs. 
inherently-logarithmic data structures, or maybe it’s just too hard to come up 
with a proper useful abstraction?

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/XA3IKKH37A5I5MKBXKDBTYB7CPGPQ2TC/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to