[issue36380] collections.Counter in-place operators are unexpectedly slow

Josh Rosenberg Wed, 20 Mar 2019 07:45:12 -0700


Josh Rosenberg <shadowranger+pyt...@gmail.com> added the comment:


@Nikita: Your examples aren't unexpected at all, per the documented behavior:

>Several mathematical operations are provided for combining Counter objects to 
>produce multisets (counters that have counts greater than zero). Addition and 
>subtraction combine counters by adding or subtracting the counts of 
>corresponding elements. Intersection and union return the minimum and maximum 
>of corresponding counts. Each operation can accept inputs with signed counts, 
>but the output will exclude results with counts of zero or less.

That's pretty explicit; "Each operation can accept inputs with signed counts" 
means the inputs can have arbitrary counts (negative, zero or positive), but 
"the output will exclude results with counts of zero or less" means the output 
will *only* contain positive counts.

This is repeated again in the design "Note" at the bottom:

* The multiset methods are designed only for use cases with positive values. 
The inputs may be negative or zero, but only outputs with positive values are 
created.

while also noting (reiterating in the case of subtract, whose method docs 
already specify this) that the update and subtract methods are appropriate for 
use when both inputs and outputs must preserve negative or zero values.

So both of your examples are "expected" per the documentation. 
"Counter(a=-1) + Counter(a=-2) produces empty Counter()" is 100% required by 
the documentation.

And, again per the documentation, _keep_positive is necessary, even for the 
in-place ops, because there is no expectation that the inputs are positive.

So your original motivating case could be spelled:

a = Counter(...)  # e.g 1M elements
b = Counter(...)  # e.g. 10 elements
# or just b = {...} or b = some_iterable_of_values_to_count since update works 
just
# fine with plain dicts and iterables to count, and making a second Counter is
# unnecessary overhead

a.update(b)

and it would work (and since _keep_positive isn't involved, it would run 
faster). Yes, if you want to eliminate non-positive counts you'll eventually 
need to do:

a += Counter()

(or abusing implementation details, a._keep_positive()), but that's largely 
unavoidable; the only way to adhere to the documented guarantees without a per 
mathematical operation _keep_positive would be to slow down many operations 
(and dramatically increase memory requirements) by separately tracking the 
non-positive values as they're created, just in case a math operation would 
need to remove them.

And when I say "slow down", I don't mean by a little. Right now, Counter 
doesn't overload __setitem__ (it just inherits dict's implementation). So when 
you do:

c = Counter()
c['a'] += 1
c['a'] += 1

that calls dict.__getitem__ twice (the first of which ends up calling 
Counter.__missing__ because the key 'a' doesn't exist), and dict.__setitem__ 
twice (to store back 1, then 2).

To make Counter track non-positive values live, you'd need to reimplement 
__setitem__ as:

def __setitem__(self, elem, count):
    if count <= 0:
        self._nonpositive.add(elem)
    else:
        self._nonpositive.discard(elem)
    return super().__setitem__(elem, count)

which would significant overhead, because:

1. Any movement from the C layer back to the Python layer is a massive 
performance hit in general
2. It wouldn't just slow __setitem__, it would also slow down the 
_count_elements accelerator for counting iterables (the thing that finally made 
Counter faster than defaultdict(int) for its intended use); the accelerator 
only uses the fast path when the mapping in question is a dict subclass and 
inherited dict's get and __setitem__ methods unmodified. If either one is 
overridden, it falls back to the slow path, which misses out on several major 
improvements available only with the concrete dict API.

On top of the performance hit, it wouldn't be remotely thread-safe: Counter 
(and most Python level types) aren't fully thread-safe to start with, but they 
generally break in predictable ways (e.g. doing c[key] += 1 in two threads 
simultaneously might only increment once, not twice, but at least the logical 
outcome of at least *one* of the operations occurred, not something wholly 
unrelated).

The race condition here would break code a long time after the actual race 
occurred; two threads could try to set the same value, one positive, one 
negative, and interleaved improperly, the negative value could be stored in the 
dict, but discarded from _nonpositive (or vice versa, but that's less 
problematic for correctness). Now when something should be eliminating 
non-positive values, it doesn't know that the negative value is even there, so 
it survives indefinitely (unless something else sets the count for it again 
anyway).

For the record, if you want to make a Counter without this behavior, it's 
actually pretty easy:

class SignedCounter(Counter):
    def _keep_positive(self):
        pass

but the interlocking documented requirements on mathematical ops and the 
methods mean you definitely can't change Counter itself.

----------
nosy: +josh.r

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue36380>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue36380] collections.Counter in-place operators are unexpectedly slow

Reply via email to