[Python-ideas] Re: zip(x, y, z, strict=True)

Alex Hall Sat, 02 May 2020 10:48:19 -0700

On Sat, May 2, 2020 at 6:09 PM Steven D'Aprano <[email protected]> wrote:


> On Sat, May 02, 2020 at 04:58:43PM +0200, Alex Hall wrote:
>
> > I didn't think carefully about this implementation and thought that there
> > was only a performance cost in the error case. That's obviously not true
> -
> > there's an `if` statement executed in Python for every item in every
> > iterable.
>
> Sorry, this does not demonstrate that the performance cost is
> significant.
>
> This adds one "if" per loop, terminating on (one more than) the shortest
> input. So O(N) on the length of the input. That's usually considered
> reasonable, provided the per item cost is low.
>
> The test in the "if" is technically O(N) on the number of input
> iterators, but since that's usually two, and rarely more than a handful,
> it's close enough to a fixed cost.
>
> On my old and slow PC `sentinel in combo` is quite fast:
>

`sentinel in combo` is problematic if some values have overridden `__eq__`.
I referred to this problem in a previous email to you, saying that people
had copied this buggy implementation from SO and that it still hadn't been
fixed after being pointed out. The fact that you missed this helps to prove
my point. Getting this right is hard.

Fortunately, more_itertools avoids this bug by not using `in`, which you
seem to not have noticed even though I copied its implementation in the
email you're responding to.

Without actual measurements, this is a classic example of premature
> micro-optimization.
>
> Let's see some real benchmarks proving that a Python version is
> too slow in real-life code first.
>

Here is a comparison of the current zip with more-itertools' zip_equal:

```
import timeit
from collections import deque
from itertools import zip_longest

_marker = object()


class UnequalIterablesError(Exception):
    pass


def zip_equal(*iterables):
    """``zip`` the input *iterables* together, but throw an
    ``UnequalIterablesError`` if any of the *iterables* terminate before
    the others.
    """
    for combo in zip_longest(*iterables, fillvalue=_marker):
        for val in combo:
            if val is _marker:
                raise UnequalIterablesError(
                    "Iterables have different lengths."
                )
        yield combo


x1 = list(range(1000))
x2 = list(range(1000, 2000))


def my_timeit(stmt):
    print(timeit.repeat(stmt, globals=globals(), number=10000, repeat=3))


def consume(iterator):
    deque(iterator, maxlen=0)

my_timeit("consume(zip(x1, x2))")
my_timeit("consume(zip_equal(x1, x2))")
``` <http://python.org/psf/codeofconduct/>

Output:

[0.15032896999999995, 0.146724568, 0.14543148299999997]
[2.039809026, 2.060877259, 2.0211361649999997]

So the Python version is about 13 times slower, and 10 million iterations
(quite plausible) adds about 2 seconds. That's not disastrous, but I think
it's significant enough that someone working with large amounts of data and
concerned about performance might choose to risk accidental malformed input.

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/FAHQYXQ4S4KMJJFJRTNYHKVH2QWQS3OY/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: zip(x, y, z, strict=True)

Reply via email to