Tim Peters <t...@python.org> added the comment:

I was suprised that

https://bugs.python.org/issue44376 

managed to get i**2 to within a factor of 2 of i*i's speed. The overheads of 
running long_pow() at all are high! Don't overlook that initialization of stack 
variables at the start, like

    PyLongObject *z = NULL;  /* accumulated result */

isn't free - code has to be generated to force zeroes into those variables. The 
initialization of `table[]` alone requires code to fill 256 memory bytes with 
zeroes (down to 128 on current main branch). Nothing is free.

We can't sanely move the `table` initialization expense into the "giant k-ary 
window exponentiation" block either, because every bigint operation can fail 
("out of memory"), and the macros for doing the common ones (MULT and REDUCE) 
can do "goto Error;", and that common exit code has no way to know what is or 
isn't initialized. We can't let it see uninitialized stack trash.

The exit code in turn has a string of things like

    Py_DECREF(a);
    Py_DECREF(b);
    Py_XDECREF(c);

and those cost cycles too, including tests and branches.

So the real "outrage" to me is why x*x took 17.6 nsec for x == 10 in the 
original report. That's many times longer than the HW takes to do the actual 
multiply. Whether it's spelled x*x or x**2, we're overwhelming timing 
overheads. `pow()` has many because it's a kind of Swiss army knife doing all 
sorts of things; what's `x*x`'s excuse? ;-)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue46020>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to