[Python-ideas] Re: Reliable Ordering Of Sets

Cade Brown Mon, 24 Aug 2020 15:52:02 -0700

Tim, thank you for the reference.

(if anyone has any tips for quoting emails which were sent before I joined
the mailing list, that would be appreciated. I will try to add relevant
details in mine)

I will try and address some of the points from that discussion and why they
aren't satisfying to me, and why Python should (still) strive for reliably
ordered sets.

[Tim Peters <[email protected]>]
> "The problem" is that deletions leave _two_ kinds of holes: one in the
hash table, and another in the contiguous vector.
> The latter holes cannot be filled with subsequent new hash+key+value
records because that would break insertion order.
...
> Algorithms doing a lot of mixing of adds and deletes seem a lot more
common for sets than for dicts, and so the ordered dict's basic
implementation _approach_ is a lot less suitable for sets.
...
> Note: the holes left by deletions _wouldn't_ be "a problem" _except_ for
maintaining insertion order.
> If we were only after the memory savings, then on deletion "the last"
record in the contiguous array could be moved into the hole at once,
leaving the array hole-free again. But that also changes the order. IIRC,
that's what the original "compact dict" _did_ do.

When a large operation which is destructive (i.e. computing some difference
of sets), there could be a `set_fill_in_holes()`, called when
`num_entries_including_holes > real_num_entries * FILL_IN_FAC`, that would
rebuild the holes.

Naively, it may take O(N) complexity, but if done alongside an operation
that is already O(N) (i.e. set difference of two sets of similar
cardinality), it may quickly disappear (especially since no 'expensive'
operations like calculating hashes are required) and not affect
performance. Specifically, I bet it would have a small constant factor,
since a simple O(N) for loop and perhaps O(N) scratch space is required, it
would just be shuffling bytes and replacing indices.

The problem may come when user code with repeated additions and
subtractions of individual elements are presented (i.e. the user writes
`for elem in B: A.remove(elem)` instead of `A -= set(B)`). This would cause
a couple of single deletions to be O(N), which may be unacceptable;
however, if this was done with specific ratios (like once the number of
entries including holes reaches twice the amount of real entries), then it
would only run every ~N deletions, which would amortize the operation to
O(1+N/N)=O(1)

> {1, 2} | {3, 4, 5, 6, 7}
> [what should the above] return as ordered sets? Beats me.;
> The imbalance between the operands' cardinalities can be arbitrarily
large, and "first make a copy of the larger one, then loop over the smaller
one" is the obvious way to implement union to minimize the number of
lookups needed.

I must agree that the 'obvious' result (given that we are speaking of
ordered-by-insertion sets) is `{1, 2, 3, 4, 5, 6, 7}`, since they appear in
left-right order on the screen (and Python is specified as
left-right-centric (i.e. lines start on the left, etc).

However, you make good points about performance concerns; big-oh notation
is not what users (and indeed, benchmarks, applications, etc) care about,
it is rather `end_time - start_time`. As long as they are both O(N) union
and O(1) testing, insertion, removal, Python implementations will work fine
and not cause standard algorithms to take preposterous amounts of time
(whereas if testing magically became O(N), it would take many algorithms to
O(N^2), or worse, leaving large problems impossible). Aside from those
travesties that should obviously be avoided, Python must weigh constant
factor performance (i.e. being able to use optimizations as described in
your post) with reliable semantics (with the well-defined result and
iteration order of ordered-by-insertion sets).

This, however, ultimately is an implementation detail, and if other
semantics for 'set' make more sense, and avoid ambiguity (such as order), I
would have to side with removing ambiguity at the loss of, say, 10% of
speed. I know I'm not in the position to make that choice, however, so I
think some tests could be useful to show just how different the performance
is.
I just thought of another pro of using ordered-by-insertion set. Consider
the common idiom to order-by-value, `sorted(myset)`; if no order is
specified, then it very may well be random (in the general case; I know
integers for example will sort the same as their hashes, and will likely be
in some order). If `myset` was constructed with Python enforcing
ordered-by-insertion sets, then special attention could be made to generate
elements of `myset` in sorted or near-sorted order, which may lead to
asymptotic or at least better constant-factor performance under the
`sorted()` function (especially with `timsort`, which will perform well if
I generate `myset` from runs of sorted values. But I'm sure you already
know that ;)).

[Serhiy Storchaka]
> But ordered set implementation is not more compact that the current set
implementation (because for dicts the item is 3-word, but for sets it is
2-word).
> Also the current set implementation has some optimizations that the dict
implementation never had, which will be lost in the ordered set
implementation

A 'compact dictionary' implementation requires `N * 3 * sizeof(intptr_t)`
(assuming hashes take up the same size as a pointer) for a (hash, key, val)
tuple per entry, and `N_buckets * sizeof(idx_t)`, where `idx_t` is selected
as the smallest (signed) integral size which can represent `N` (and -1, -2,
for some special values). `N_buckets = N / load()`

This costs (on most 64 bit systems): `load() * N_buckets * 3 * 8 + 1 *
N_buckets` bytes for dicts of length <= 127, `load() * N_buckets * 3 * 8 +
2 * N_buckets` for dicts of length <= SHRT_MAX, and so on. A more naive
dictionary would take simply `N_buckets * 3 * 8` bytes.

A similar strategy could be used for sets (which would, as with dicts,
preserve order of the items tuple array), and instead have `load() *
N_buckets * 2 * 8 + 1 * N_buckets` bytes required (and so on, for other
thresholds), versus a naive `N_buckets * 2 * 8` byte requirement.

Therefore, I push back on the idea that ordered-by-insertion sets are not
more memory efficient -- they can use the same optimizations that
ordered-by-insertion dictionaries are currently using (albeit with tighter
margins, since only 2 words are saved per empty bucket instead of 3, as
with dictionaries). If someone can reference me why I'm either being
completely ignorant or somehow wrong, I'd love to hear it ;).

Thanks,
~
Cade Brown
Working on MAGMA <https://icl.cs.utk.edu/magma/>
http://cade.site
http://chemicaldevelopment.us

On Mon, Aug 24, 2020 at 5:17 PM Tim Peters <[email protected]> wrote:

> [Cade Brown <[email protected]>, suggests ordered sets]
>
> FYI, last time this went around was December, with over 100 msgs here:
>
>
> https://mail.python.org/archives/list/[email protected]/thread/AEKCBGCKX23S4PMA5YNDHFJRHDL4JMKY/#AEKCBGCKX23S4PMA5YNDHFJRHDL4JMKY
>

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/NXEQJSY2H6A2NCOCFEPSLVFGU54SMOIY/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Reliable Ordering Of Sets

Reply via email to