Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

Matti Picus Mon, 10 Feb 2020 22:54:12 -0800


On 11/2/20 7:16 am, Ralf Gommers wrote:

On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi<einstein.edi...@gmail.com <mailto:einstein.edi...@gmail.com>> wrote:
    —snip—

    > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, 
if
    contributors want to leverage a new architecture specific SIMD
    instruction, will they be expected to add software implementation
    of this instruction for all other architectures too?

    In my opinion, if the instructions are lower, then yes. For
    example, one cannot add AVX-512 without adding, for example adding
    AVX-256 and AVX-128 and SSE*.  However, I would not expect one
    person or team to be an expert in all assemblies, so intrinsics
    for one architecture can be developed independently of another.
I think this doesn't quite answer the question. If I understandcorrectly, it's about a single instruction (e.g. one needs|"VEXP2PD"and it's missing from the supported AVX512 instructions inmaster). I think the answer is yes, it needs to be added for otherarchitectures as well. Otherwise, if universal intrinsics are addedad-hoc and there's no guarantee that a universal instruction isavailable for all main supported platforms, then over time there won'tbe much that's "universal" about the framework.|
|
|
|This is a different question though from adding a new ufuncimplementation. I would expect accelerating ufuncs via intrinsics thatare already supported to be much more common than having to add newintrinsics. Does that sound right?
|

|Yes. Universal intrinsics are cross-platform. However the NEP is opento the possibility that certain architectures may have SIMD intrinsicsthat cannot be expressed in terms of intrinsics for other platforms, andso there may be a use case for architecture-specific loops. This isexplicitly stated in the latest PR to the NEP: "|If the regression isnot minimal, we may choose to keep the X86-specific code for thatplatform and use the universal intrisic code for other platforms."

|
|


    > 2) On whom does the burden lie to ensure that new
    implementations are benchmarked and shows benefits on every
    architecture? What happens if optimizing an Ufunc leads to
    improving performance on one architecture and worsens performance
    on another?
This is slightly hard to provide a recipe for. I suspect it may take awhile before this becomes an issue, since we don't have much SIMD codeto begin with. So adding new code with benchmarks will likely showimprovements on all architectures (we should ensure benchmarks can berun via CI, otherwise it's too onerous). And if not and it's noteasily fixable, the problematic platform could be skipped soperformance there is unchanged.

On HEAD, out of the 89 ufuncs innumpy.core.code_generators.generate_umath.defdict, 34 have X86-specificsimd loops:

>>> [x for x in defdict.keys() if any([td.simd for td indefdict[x].type_descriptions])]['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal','absolute', 'negative', 'greater', 'greater_equal', 'less','less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not','logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or','bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin','exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint']

They would be the first targets for universal intrinsics. Of them Iestimate that the ones with more than one loop for at least one dtypesignature would be the most difficult, since these have differentoptimizations for avx2, fma, and/or avx512f:

['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt','ceil', 'trunc', 'floor', 'rint']



The other 55 ufuncs, for completeness, are

['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power','float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip','fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees','rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin','arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2','expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2','remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite','signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp','gcd', 'lcm', 'matmul']

As for testing accuracy: we recently added a framework for testing ulpvariation of ufuncs against "golden results" innumpy/core/tests/test_umath_accuracy. So far float32 is tested for exp,log, cos, sin. Others may be tested elsewhere by specific tests, forinstance numpy/core/test/test_half.py has test_half_ufuncs.

It is difficult to do benchmarking on CI: the machines that run CI varytoo much. We would need to set aside a machine for this and carefullyset it up to keep CPU speed and temperature constant. We do havebenchmarks for ufuncs (they could always be improved). I think Pauliruns the benchmarks carefully on X86, and may even makes the resultspublic, but that resource is not really on PR reviewers' radar. We couldrun benchmarks on the gcc build farm machines for other architectures.Those machines are shared but not heavily utilized.

Only once there's existing universal intrinsics and then they'retweaked will we have to be much more careful I'd think.




    I would look at this from a maintainability point of view. If we
    are increasing the code size by 20% for a certain ufunc, there
    must be a domonstrable 20% increase in performance on any CPU.
    That is to say, micro-optimisation will be unwelcome, and code
    readability will be preferable. Usually we ask the submitter of
    the PR to test the PR with a machine they have on hand, and I
    would be inclined to keep this trend of self-reporting. Of course,
    if someone else came along and reported a performance regression
    of, say, 10%, then we have increased code by 20%, with only a net
    5% gain in performance, and the PR will have to be reverted.

    —snip—

I think we should be careful not to increase the reviewer burden, andtry to automate as much as possible. It would be nice if we could atsome point set up a set of bots that can be triggered to run benchmarksfor us and report in the PR the results.



Matti

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

Reply via email to