On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarn...@yahoo.com> wrote:
> One case I think you didn’t test is when the strings are generated in > already-sorted order. In that case, as opposed to the case where you > generate in random order and then sort, I think the PyUnicode objects and > the actual character arrays will both be mostly contiguous (at least in a > fresh process)—although I think the only way you could verify that is by > using ctypes (or a C extension) to peek into the internal PyUnicode structs > and look at the appropriate pointers. Anyway, the access pattern should > more than be simple enough for the CPU to effectively cache-prefetch at all > three levels, which could be a significant speedup for the sorted-list > algorithms over the set algorithms. > I did not benchmark it, because the original code from the OP has an explicit sorting built-in, so I simply concluded this was not the case. Nevertheless... > And this isn’t entirely an artificial scenario; sometimes large numbers of > strings really are generated in sorted order and in large batches like > this—e.g., the filesystem, or a network client, or whatever is handing you > data in sorted order and you put it right into a list of strings. So, it > might be worth testing. > For the sake of completeness I did some benchmarks also with already sorted strings. I chose the following strings: In [3]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [4]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)] In [5]: a[:5] Out[5]: ['String#00000000', 'String#00000002', 'String#00000004', 'String#00000006', 'String#00000008'] In [6]: b[:5] Out[6]: ['String#00000001', 'String#00000003', 'String#00000005', 'String#00000007', 'String#00000009'] First with the set ops: In [7]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[7]: 3.9231181000000106 In [8]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[8]: 2.642592999999991 In [9]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[9]: 2.648031700000004 Does not really differ from the previous results. Then with the `list_relcomp` (note 2x `False` in the argument list to disable the sorting): In [10]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [11]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)] In [12]: timeit('list_relcomp(a, b, False, False)', number=1, globals=locals()) Out[12]: 4.851956800000011 In [13]: timeit('list_relcomp(a, b, False, False)', number=1, globals=locals()) Out[13]: 4.8776169999999865 One interesting thing I noticed, when running `list_relcomp` with a trivial input, e.g. (where a = c, but in different objects): In [16]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [17]: c = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [18]: timeit('list_relcomp(a, c, False, False)', number=1, globals=locals()) Out[18]: 2.8997035999999525 It ends up running much faster, because it does not have to build the resulting list (set implementation performance does not change much in this case) while in the first setup, everything in A goes into the result. So there are many details, but unless the use case can guarantee them, I am not sure they are worthy the speculations. But I think the real win for already-sorted data is laziness. With sets, at > least one of your data sets has to be strict, so you can put it in a set. > With sorted lists, with a minor rewrite to the algorithms (which I posted > earlier for intersection), you can use them directly on iterators instead. > Imagine you’re reading 7GB worth of (already-sorted) strings and comparing > to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap > space. Clearly the set, sorted list, or even anything clever that pandas > might do, they’ll all throw your machine into swap hell, which will > probably swamp the cost of using step-compare instead of sets and of using > Iterators instead of lists. This one might be harder to benchmark (because > the cost of the Iterator is going to end up threaded through the cost of > the algorithm), but it seems like it should be a huge win over sets. > If you move everything to the iterators and maybe even dilute the operations over some async I/O it may be pretty effective, but when talking about the algorithm alone, I am not sure even the pre-sorted lists have an advantage, at least not in this implementation, maybe when implemented in C? Richard M.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/F3Q3JYFUM6T5JVYRMSO3CLZEKXRAM4FP/ Code of Conduct: http://python.org/psf/codeofconduct/