I'm about to write some code to manage sets, and wanted to float a few thoughts here because I have various ideas about how to implement what I want to do, and I think one of them could be done by changing Python's set type in useful and backward compatible way.
Apologies if this is discussed in the archives. I didn't see it. My original motivation is a need to work efficiently with very large sets, let's say 10M elements. In some cases, you can operate on a set without needing to explicitly represent all its elements. For example, if my set is the Universal (U) set, then X in U, should return True. If my set is U and I union it with any other set, it is unchanged, etc. By also maintaining a finite set of things that are _not_ in my large set, I can do other operations efficiently. E.g., calling U.remove(X) would _add_ X to the set of elements that were known not to be in the big set that you were trying to represent efficiently. In most cases, such a set class would simply do the opposite/inverse operation on its finite exclusion set. While it's convenient to warm up by talk about the infinite Universal set, we could just as easily be talking about arbitrarily large finite sets. There are some examples below. Another way to discuss the same thing is to talk about set complements. It would be nice to be able to complement (or invert) a set. Once you did so, you might have an infinite set on your hands, but as the last paragraph argues, you can still operate on an infinite set efficiently. Naturally, you can't fully enumerate it, or take its size. I think that giving Python sets the ability to handle complementarity would have some nice uses - and that these go well beyond the particular thing that I need right now. For example, suppose you want to represent the set of all floating point numbers. You just create an empty set and complement it. Then you can test it as usual, and begin to exclude things from it. Or if you have a system with 10M objects and you need to support operations on these objects via a query language that lets the user type "not X" where X is a set of objects, there is a natural and efficient way (constant time) to execute the query - you just mark the set as being inverted (complemented). If I were going to implement the above by changing Python's built in set type, here's what I think I'd do: Add four optional keyword arguments to the set constructor: - invert=False - universalSet=None - universalSetSize=None - universeExclusionFunc=None invert is just a flag, indicating the sense of the set. If inverse is False: the set behaves exactly as a normal Python set and the other three new arguments are ignored. If inverse is True: universalSet represents the universal set. If it is None, then the universal set is infinite. Otherwise, universalSet is an iterable. The implementation should not call this iterable unless it's unavoidable, on the presumption that if the programmer wanted to operate directly on the contents of this iterable, they'd be doing it in a conventional fashion (e.g., with a normal set). The universalSetSize, used for len() calculations, is the number of elements in universalSet, if known & if finite. The universeExclusionFunc can be called with a single element argument to see if the element should be considered excluded from the universalSet. This may seem like a weird idea, but it's the key to flexibility and efficiency. In an invert=True set, the code would use the normal set content as a mutable set of objects that are not in the universe, as described above, but would, in addition, use the universeExclusionFunc (if defined) to identify elements not in the set (because they are not in the universe), and thus avoid the use of the expensive (or infinite) universalSet. Note that an inverted (or normal) set can be inverted by simply setting invert=False, so this requires a new invert() method (which could be triggered by the use of something like 'not' or '!' or '~'). In this case, an inverted set becomes a normal Python set. The elements represented by universeExclusionFunc remain invalid - they are simply not in the universe (and, if deemed sensible, this could be sanity checked in add()). If it's not clear, when a set is inverted, any iterable given to __init__, (i.e., the iterable argument in the normal case of constructing a Python set), is just treated as usual (but in this case will be taken to represent things not in the set initially). Here are some examples of usage: 1. Suppose you want to work with the set of integers [0, 100000000) and that initially your set is all such integers. You could create this set via: S = set(invert=True, universalSet=xrange(100000000), universalSetSize=100000000, universeExclusionFunc=(lambda x: x >= 100000000) This has the intended effect, is efficient, and no-one need call the iterator. You can (I think) do everything you could do with a normal set (including inverting it). If the user happens to want to list all the elements in the set, that's their business and they can do it, except in the case where universalSet=None, in which case I would raise an exception. 2. Suppose you wanted to represent the set of all floating point numbers > 0.0 that were not integers. You'd do this via: def excludeFunc(f): try: f = float(f) except ValueError: return True if f <= 0.0: return True if f - math.floor(f) == 0.0: return True return False S = set(invert=True, universeExclusionFunc=excludeFunc) 3. Let's say you wanted to work with the set of all IP addresses, as represented by lists of 4 integers, but that you want to exclude 127.0.0.1, and all the reserved RFC1918 address blocks def excludeFunc(ip): # Insert code to check type and length. if (ip == [127, 0, 0, 1]): return True if (ip[0] == 10 or (172, 16, 0, 0) <= ip <= (172, 31, 255, 255) or (191, 168, 0, 0) <= ip <= (192, 168, 255, 255)): return True # Insert code to check all bytes in [0..255], return True if not. return False S = set(invert=True, universeExclusionFunc=excludeFunc) It's pretty easy to think of other examples of wanting to efficiently use big sets, or to use small sets and then needing to complement them and operate efficiently on their large complements. One reaction to these examples might be that it's no big deal to write your own code to, say, manage sets of IP addresses. I think that's not so easy to do efficiently if the set is massive (and you'd end up keeping track of the complement or using offline storage). Doing it as suggested above makes it simple, and once you call the constructor, you just do normal Python set operations. This suggestion goes a little way towards giving "continuous" sets. One way to categorize set support is: A. Only support finite sets. B. Provide category A but also support infinite sets where at least one of the set and its complement is finite. C. Provide category B but also allow for the case where a set and its complement may both be infinite. Thus I am proposing something that would take set support in Python from category A to category B. Implementing B efficiently is provably possible since all operations are done on either a set or its complement, whichever is known to be finite, and Python already handles those operations (via its category A support). The downsides that I see at the moment are: 1. Two new exceptions that can arise. These only happen on sets that are in the inverted state, and only when the universal set is infinite: - You try to enumerate an infinite set. - You call len() on an infinite set. 2. Operations on regular sets get slightly slower because virtually all set operations would need to test the invert flag. So that's an extra Boolean test on every set operation. I don't think the implementation is too hard - the Python sets.py code is very clean - though there may be gotchas. Anyway, I wanted to see what people think of this before I go ahead and solve my actual problem. If the above is considered potentially worthwhile to Python, I can work on adapting sets.py. If not, I'll do the faster and simpler thing and write my own class. Regards, Terry _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com