Hm... Without reading though all this, I expect that you'd be better off implementing this for yourself without attempting to pull the standard library sets into the picture (especially since sets.py is obsolete as of 2.4; set and frozenset are now built-in types). You're really after rather specialized set representations. I've done this myself and as long as you stick to solving *just* the problem at hand, it's easy. If you want a general solution, it's hard...
--Guido On 5/11/06, Terry Jones <[EMAIL PROTECTED]> wrote: > I'm about to write some code to manage sets, and wanted to float a few > thoughts here because I have various ideas about how to implement what I > want to do, and I think one of them could be done by changing Python's set > type in useful and backward compatible way. > > Apologies if this is discussed in the archives. I didn't see it. > > My original motivation is a need to work efficiently with very large sets, > let's say 10M elements. In some cases, you can operate on a set without > needing to explicitly represent all its elements. For example, if my set is > the Universal (U) set, then X in U, should return True. If my set is U and > I union it with any other set, it is unchanged, etc. By also maintaining a > finite set of things that are _not_ in my large set, I can do other > operations efficiently. E.g., calling U.remove(X) would _add_ X to the set > of elements that were known not to be in the big set that you were trying > to represent efficiently. In most cases, such a set class would simply do > the opposite/inverse operation on its finite exclusion set. > > While it's convenient to warm up by talk about the infinite Universal set, > we could just as easily be talking about arbitrarily large finite sets. > There are some examples below. > > Another way to discuss the same thing is to talk about set complements. It > would be nice to be able to complement (or invert) a set. Once you did so, > you might have an infinite set on your hands, but as the last paragraph > argues, you can still operate on an infinite set efficiently. Naturally, > you can't fully enumerate it, or take its size. > > I think that giving Python sets the ability to handle complementarity would > have some nice uses - and that these go well beyond the particular thing > that I need right now. > > For example, suppose you want to represent the set of all floating point > numbers. You just create an empty set and complement it. Then you can test > it as usual, and begin to exclude things from it. Or if you have a system > with 10M objects and you need to support operations on these objects via a > query language that lets the user type "not X" where X is a set of objects, > there is a natural and efficient way (constant time) to execute the query - > you just mark the set as being inverted (complemented). > > If I were going to implement the above by changing Python's built in set > type, here's what I think I'd do: > > Add four optional keyword arguments to the set constructor: > > - invert=False > - universalSet=None > - universalSetSize=None > - universeExclusionFunc=None > > invert is just a flag, indicating the sense of the set. > > If inverse is False: > > the set behaves exactly as a normal Python set and the other three > new arguments are ignored. > > If inverse is True: > > universalSet represents the universal set. If it is None, then the > universal set is infinite. Otherwise, universalSet is an iterable. The > implementation should not call this iterable unless it's unavoidable, on > the presumption that if the programmer wanted to operate directly on the > contents of this iterable, they'd be doing it in a conventional fashion > (e.g., with a normal set). The universalSetSize, used for len() > calculations, is the number of elements in universalSet, if known & > if finite. > > The universeExclusionFunc can be called with a single element argument to > see if the element should be considered excluded from the universalSet. > This may seem like a weird idea, but it's the key to flexibility and > efficiency. In an invert=True set, the code would use the normal set > content as a mutable set of objects that are not in the universe, as > described above, but would, in addition, use the universeExclusionFunc > (if defined) to identify elements not in the set (because they are not in > the universe), and thus avoid the use of the expensive (or infinite) > universalSet. > > Note that an inverted (or normal) set can be inverted by simply setting > invert=False, so this requires a new invert() method (which could be > triggered by the use of something like 'not' or '!' or '~'). In this case, > an inverted set becomes a normal Python set. The elements represented by > universeExclusionFunc remain invalid - they are simply not in the universe > (and, if deemed sensible, this could be sanity checked in add()). > > If it's not clear, when a set is inverted, any iterable given to __init__, > (i.e., the iterable argument in the normal case of constructing a Python > set), is just treated as usual (but in this case will be taken to > represent things not in the set initially). > > > Here are some examples of usage: > > 1. Suppose you want to work with the set of integers [0, 100000000) and > that initially your set is all such integers. You could create this set > via: > > S = set(invert=True, > universalSet=xrange(100000000), > universalSetSize=100000000, > universeExclusionFunc=(lambda x: x >= 100000000) > > This has the intended effect, is efficient, and no-one need call the > iterator. You can (I think) do everything you could do with a normal set > (including inverting it). If the user happens to want to list all the > elements in the set, that's their business and they can do it, except in > the case where universalSet=None, in which case I would raise an > exception. > > 2. Suppose you wanted to represent the set of all floating point numbers > > 0.0 that were not integers. You'd do this via: > > def excludeFunc(f): > try: > f = float(f) > except ValueError: > return True > if f <= 0.0: > return True > if f - math.floor(f) == 0.0: > return True > return False > > S = set(invert=True, universeExclusionFunc=excludeFunc) > > 3. Let's say you wanted to work with the set of all IP addresses, as > represented by lists of 4 integers, but that you want to exclude > 127.0.0.1, and all the reserved RFC1918 address blocks > > def excludeFunc(ip): > # Insert code to check type and length. > if (ip == [127, 0, 0, 1]): > return True > if (ip[0] == 10 or > (172, 16, 0, 0) <= ip <= (172, 31, 255, 255) or > (191, 168, 0, 0) <= ip <= (192, 168, 255, 255)): > return True > # Insert code to check all bytes in [0..255], return True if not. > return False > > S = set(invert=True, universeExclusionFunc=excludeFunc) > > It's pretty easy to think of other examples of wanting to efficiently use > big sets, or to use small sets and then needing to complement them and > operate efficiently on their large complements. > > One reaction to these examples might be that it's no big deal to write your > own code to, say, manage sets of IP addresses. I think that's not so easy > to do efficiently if the set is massive (and you'd end up keeping track of > the complement or using offline storage). Doing it as suggested above makes > it simple, and once you call the constructor, you just do normal Python set > operations. > > This suggestion goes a little way towards giving "continuous" sets. One way > to categorize set support is: > > A. Only support finite sets. > > B. Provide category A but also support infinite sets where at least > one of the set and its complement is finite. > > C. Provide category B but also allow for the case where a set and its > complement may both be infinite. > > Thus I am proposing something that would take set support in Python from > category A to category B. Implementing B efficiently is provably possible > since all operations are done on either a set or its complement, whichever > is known to be finite, and Python already handles those operations (via its > category A support). > > > The downsides that I see at the moment are: > > 1. Two new exceptions that can arise. These only happen on sets that are in > the inverted state, and only when the universal set is infinite: > > - You try to enumerate an infinite set. > - You call len() on an infinite set. > > 2. Operations on regular sets get slightly slower because virtually all set > operations would need to test the invert flag. So that's an extra > Boolean test on every set operation. > > I don't think the implementation is too hard - the Python sets.py code is > very clean - though there may be gotchas. > > Anyway, I wanted to see what people think of this before I go ahead and > solve my actual problem. If the above is considered potentially worthwhile > to Python, I can work on adapting sets.py. If not, I'll do the faster and > simpler thing and write my own class. > > Regards, > Terry > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com