[Python-ideas] Re: Set operations with Lists

Richard Musil Sun, 22 Sep 2019 12:54:50 -0700

On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarn...@yahoo.com> wrote:


> One case I think you didn’t test is when the strings are generated in
> already-sorted order. In that case, as opposed to the case where you
> generate in random order and then sort, I think the PyUnicode objects and
> the actual character arrays will both be mostly contiguous (at least in a
> fresh process)—although I think the only way you could verify that is by
> using ctypes (or a C extension) to peek into the internal PyUnicode structs
> and look at the appropriate pointers. Anyway, the access pattern should
> more than be simple enough for the CPU to effectively cache-prefetch at all
> three levels, which could be a significant speedup for the sorted-list
> algorithms over the set algorithms.
>

I did not benchmark it, because the original code from the OP has an
explicit sorting built-in, so I simply concluded this was not the case.
Nevertheless...


> And this isn’t entirely an artificial scenario; sometimes large numbers of
> strings really are generated in sorted order and in large batches like
> this—e.g., the filesystem, or a network client, or whatever is handing you
> data in sorted order and you put it right into a list of strings. So, it
> might be worth testing.
>

For the sake of completeness I did some benchmarks also with already sorted
strings. I chose the following strings:

In [3]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)]
In [4]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)]
In [5]: a[:5]
Out[5]:
['String#00000000',
 'String#00000002',
 'String#00000004',
 'String#00000006',
 'String#00000008']
In [6]: b[:5]
Out[6]:
['String#00000001',
 'String#00000003',
 'String#00000005',
 'String#00000007',
 'String#00000009']

First with the set ops:

In [7]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[7]: 3.9231181000000106
In [8]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[8]: 2.642592999999991
In [9]: timeit('list(set(a).difference(b))', number=1, globals=locals())
Out[9]: 2.648031700000004

Does not really differ from the previous results. Then with the
`list_relcomp` (note 2x `False` in the argument list to disable the
sorting):

In [10]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)]
In [11]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)]
In [12]: timeit('list_relcomp(a, b, False, False)', number=1,
globals=locals())
Out[12]: 4.851956800000011
In [13]: timeit('list_relcomp(a, b, False, False)', number=1,
globals=locals())
Out[13]: 4.8776169999999865

One interesting thing I noticed, when running `list_relcomp` with a trivial
input, e.g. (where a = c, but in different objects):

In [16]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)]
In [17]: c = [f'String#{i:08}' for i in range(0, 20000000, 2)]
In [18]: timeit('list_relcomp(a, c, False, False)', number=1,
globals=locals())
Out[18]: 2.8997035999999525

It ends up running much faster, because it does not have to build the
resulting list (set implementation performance does not change much in this
case) while in the first setup, everything in A goes into the result. So
there are many details, but unless the use case can guarantee them, I am
not sure they are worthy the speculations.

But I think the real win for already-sorted data is laziness. With sets, at
> least one of your data sets has to be strict, so you can put it in a set.
> With sorted lists, with a minor rewrite to the algorithms (which I posted
> earlier for intersection), you can use them directly on iterators instead.
> Imagine you’re reading 7GB worth of (already-sorted) strings and comparing
> to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap
> space. Clearly the set, sorted list, or even anything clever that pandas
> might do, they’ll all throw your machine into swap hell, which will
> probably swamp the cost of using step-compare instead of sets and of using
> Iterators instead of lists. This one might be harder to benchmark (because
> the cost of the Iterator is going to end up threaded through the cost of
> the algorithm), but it seems like it should be a huge win over sets.
>

If you move everything to the iterators and maybe even dilute the
operations over some async I/O it may be pretty effective, but when talking
about the algorithm alone, I am not sure even the pre-sorted lists have an
advantage, at least not in this implementation, maybe when implemented in C?

Richard M.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/F3Q3JYFUM6T5JVYRMSO3CLZEKXRAM4FP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to