[Python-ideas] Re: Set operations with Lists

Andrew Barnert via Python-ideas Thu, 19 Sep 2019 11:56:51 -0700

On Sep 18, 2019, at 12:29, Richard Higginbotham <higgi...@gmail.com> wrote:
> 
> I'm not sure if there is any interest by others but I have frequently come 
> across cases where I would like to compare items in one list in another 
> similar to relational algebra.


For anyone who doesn’t want to read this whole reply:

I think a step-compare function is useful often enough. and confuses novices 
often enough (if StackOverflow is any guide), that it might belong in the 
stdlib. But I think it should be in itertools, and should work on arbitrary 
iterables that are assumed to be pre-sorted (an assumption groupby already 
makes, and explains in the docs). And if it’s not in more-itertools and/or 
toolz, maybe it’s worth submitting it to them first and seeing how much people 
use it.

Now on to why:

To start with, itertools is already about providing an iterator algebra that 
isn’t identical to relational algebra, but is a lot closer than what the list 
builtin provides.

> For example are the file names in A in B and if so return a new list with 
> those items. Long story short, I wrote some functions to do that. They are 
> quite simple and fast (due to timsort in no small part).

> Even the plain python code is faster than the built in set functions (afaik).

You’re testing pretty small values in a very narrow use case, and testing them 
inaccurately by using datetime instead of timeit (the fact that some of your 
results are 0 vs. 0 should be a clue…), and you haven’t even given us the 
results, just claimed that they’re faster.

When I run my own test on lists of 10000 strings made up 5 random lowercase 
characters, I get 485us for set(a).intersection(b), and 6609us for 
list_intersection(a, b), so it’s actually more than 10x slower.

What if the values are already sorted, or already in sets? Then I get 175us for 
a&b, 3250us for list_intersection(a, b, False, False), so again it’s more than 
10x slower.

Maybe it’s because my set has a lot more duplicates. What I do 50-character 
strings, and verify that there are no dups? Now it’s 525us vs. 6699us, or 225us 
vs. 2910us for pre-prepared values—a bit closer, but still more than 10x slower.

And meanwhile, what if I simulate long pathnames with a common prefix? Now I 
get 525us vs. 10200us.

> I created a github and put the ones I thought the community might like in 
> there. https://github.com/ponderpanda/listex
> 
> an example would be
> a = [1,2,3,4,5,6]
> b = [1,3,7,4]
> list_intersection(a,b, sorta=True, sortb=True)
> 
> returns [1,3,4]
> 
> The complexity ends up being about the longer of list a or b.

How can that be true? Sorting a takes nlogn steps, sorting b takes mlogm steps, 
step-comparing then takes n+m steps, so that’s O(nlogn + mlogm * n + m) = 
O(nlogn), which is bigger than O(n).

By comparison, making a set of b takes m steps, iterating over a and looking up 
each value in the hash takes n steps, so it’s O(n+m) = O(n), which is smaller 
than O(nlogn).

It’s possible that the constant multiplier for hash lookups is so much higher 
than the multiplier for comparing and nexting that you’d need to get into the 
zillions before the logn factor outweighs the multiplier, in which case you 
could argue that it’s a good optimization despite having worse complexity. But 
you can’t argue that it has better complexity. Also, if that lower constant 
really is critical to performance, surely it’s not applicable to all values, as 
the long-pathname case demonstrates.

Plus, given that your code is about 10x as slow even _without_ the sort, it 
doesn’t seem to be true that the multiplier is better in the first place.

Also, forgetting about performance, sort-and-step doesn’t work on values that 
aren’t comparable, while set does. On the other hand, set doesn’t work on 
values that aren’t hashable, while sort-and-step does. Strings and integers are 
both, but lots of types are only one or the other (list, complex). For 
something general-purpose enough that you want to add it to builtins, that 
seems at least worth mentioning.

> if they are over some thousands the speed difference with set() operations 
> becomes significant. There are some details about naming, unique values (not 
> necessary), and sort order that would probably need to be ironed out if they 
> were to be included with the built in list type.

Why would you want this as a method of list rather than a function that works 
just as well on any sortable iterables? The reason we have a sort method (as 
well as, not instead of, a sorted function) is to allow sorting in-place, which 
isn’t relevant here. We don’t have methods for zip, filter, etc.

Also, in most cases where I’ve wanted to intersect two lists of file names, I 
want to preserve the order of the first “master” list. You can’t do that with 
sort-and-step except by doing something pretty complicated (e.g., decorating 
with enumerate, using a custom compare that ignores the indexes in your 
intersection, then re-sorting the result by index and undecorating), but it’s 
trivial with sets:

    filterset = set(b)
    c = [val for val in a if val in filterset]

… or …

    c = filter(set(b).__contains__, a)

And this is still O(n+m) time, just like set(a)&set(b) would be, and with a 
faster constant because we’re only setifying one of the two iterables (and 
often the smaller one).

And this has the advantage that the potentially huge “master” input can be a 
lazy Iterator. If a is a file listing the first names of each American, and b 
is a much smaller set of the legal names for French children, loading that 
whole file into memory to sort it would be a bad idea, and it’s not necessary 
if you use sets.

Also, the last time I wanted to intersect two lists of file names was to diff 
two directories, so I needed a-b and b-a, not just a&b. Is there really a 
compelling use case for intersection that isn’t also a compelling use case for 
at least some of the other set operations? In fact, you’re arguing for just 
intersection, but then provided an implementation for some of the others. Which 
do you want, and why that particular set?

Also, if you care about duplicates (and if you don’t care about either order or 
duplicates, why aren’t you just using a set in the first place?), you need to 
specify whether this is multiset intersection (so if X appears 5 times in A and 
2 times in B it shows up 2 times in A&B), or set intersection (so it shows up 
just once). The most obvious way of doing sort-and-step is going to do neither 
of those (it’ll be 5, because it appears 5 times in A and each of those has a 
matching value in B); you could implement either pretty easily, but only if you 
decide which you want and then do it. With sets, you explicitly make the choice 
of set or Counter, and then whether you get set or multiset behavior is 
obvious; here, it’s not obvious, and if you want it to be a choice you have to 
add another flag argument.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/O6UVV7KVN23LRLZEZYSYKN2REP7WAQIZ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to