On 6/11/21 6:56 pm, Sayed Adel wrote:

> appears to be poorly optimized.

It should perform well, not poor neither heavily optimized.

> this also makes it quite difficult to improve (with either a better compiler or by hand).

We can put the blame on Intel for not sharing their source code but honestly, it seems we had no other option except accept what they provide.

> Some of the glaring issues are:
> 1. register allocation / spilling
> 2. rodata layouts / const-propagation of the values.
> 3. Very odd use of internal functions that really ought to be inlined.

let me add to your list another two points:
- It only works on Linux.
- It only works with AVX512.

> If so, are people open to patches that optimize them (either with new C implementations are in the current assembly
implementations).

Hopefully, we will able to convert them to universal intrinsics (nep-38) one day. As one of the team, I will try to push more time for it.

Thanks, Sayed.


Note the benchmarks on Sayed's PR [0] to move tanh to universal intrinsics. It not only supplies the routines for all universal-intrinsics-supported platforms, it even slightly increased performance on AVX512 (usual disclaimers about dangers of comparing benchmarks apply).


Matti


[0] https://github.com/numpy/numpy/pull/20363

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to