Josh Rosenberg <[email protected]> added the comment:
@Nikita: Your examples aren't unexpected at all, per the documented behavior:
>Several mathematical operations are provided for combining Counter objects to
>produce multisets (counters that have counts greater than zero). Addition and
>subtraction combine counters by adding or subtracting the counts of
>corresponding elements. Intersection and union return the minimum and maximum
>of corresponding counts. Each operation can accept inputs with signed counts,
>but the output will exclude results with counts of zero or less.
That's pretty explicit; "Each operation can accept inputs with signed counts"
means the inputs can have arbitrary counts (negative, zero or positive), but
"the output will exclude results with counts of zero or less" means the output
will *only* contain positive counts.
This is repeated again in the design "Note" at the bottom:
* The multiset methods are designed only for use cases with positive values.
The inputs may be negative or zero, but only outputs with positive values are
created.
while also noting (reiterating in the case of subtract, whose method docs
already specify this) that the update and subtract methods are appropriate for
use when both inputs and outputs must preserve negative or zero values.
So both of your examples are "expected" per the documentation.
"Counter(a=-1) + Counter(a=-2) produces empty Counter()" is 100% required by
the documentation.
And, again per the documentation, _keep_positive is necessary, even for the
in-place ops, because there is no expectation that the inputs are positive.
So your original motivating case could be spelled:
a = Counter(...) # e.g 1M elements
b = Counter(...) # e.g. 10 elements
# or just b = {...} or b = some_iterable_of_values_to_count since update works
just
# fine with plain dicts and iterables to count, and making a second Counter is
# unnecessary overhead
a.update(b)
and it would work (and since _keep_positive isn't involved, it would run
faster). Yes, if you want to eliminate non-positive counts you'll eventually
need to do:
a += Counter()
(or abusing implementation details, a._keep_positive()), but that's largely
unavoidable; the only way to adhere to the documented guarantees without a per
mathematical operation _keep_positive would be to slow down many operations
(and dramatically increase memory requirements) by separately tracking the
non-positive values as they're created, just in case a math operation would
need to remove them.
And when I say "slow down", I don't mean by a little. Right now, Counter
doesn't overload __setitem__ (it just inherits dict's implementation). So when
you do:
c = Counter()
c['a'] += 1
c['a'] += 1
that calls dict.__getitem__ twice (the first of which ends up calling
Counter.__missing__ because the key 'a' doesn't exist), and dict.__setitem__
twice (to store back 1, then 2).
To make Counter track non-positive values live, you'd need to reimplement
__setitem__ as:
def __setitem__(self, elem, count):
if count <= 0:
self._nonpositive.add(elem)
else:
self._nonpositive.discard(elem)
return super().__setitem__(elem, count)
which would significant overhead, because:
1. Any movement from the C layer back to the Python layer is a massive
performance hit in general
2. It wouldn't just slow __setitem__, it would also slow down the
_count_elements accelerator for counting iterables (the thing that finally made
Counter faster than defaultdict(int) for its intended use); the accelerator
only uses the fast path when the mapping in question is a dict subclass and
inherited dict's get and __setitem__ methods unmodified. If either one is
overridden, it falls back to the slow path, which misses out on several major
improvements available only with the concrete dict API.
On top of the performance hit, it wouldn't be remotely thread-safe: Counter
(and most Python level types) aren't fully thread-safe to start with, but they
generally break in predictable ways (e.g. doing c[key] += 1 in two threads
simultaneously might only increment once, not twice, but at least the logical
outcome of at least *one* of the operations occurred, not something wholly
unrelated).
The race condition here would break code a long time after the actual race
occurred; two threads could try to set the same value, one positive, one
negative, and interleaved improperly, the negative value could be stored in the
dict, but discarded from _nonpositive (or vice versa, but that's less
problematic for correctness). Now when something should be eliminating
non-positive values, it doesn't know that the negative value is even there, so
it survives indefinitely (unless something else sets the count for it again
anyway).
For the record, if you want to make a Counter without this behavior, it's
actually pretty easy:
class SignedCounter(Counter):
def _keep_positive(self):
pass
but the interlocking documented requirements on mathematical ops and the
methods mean you definitely can't change Counter itself.
----------
nosy: +josh.r
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue36380>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com