On 11/2/20 7:16 am, Ralf Gommers wrote:


On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edi...@gmail.com <mailto:einstein.edi...@gmail.com>> wrote:

    —snip—

    > 1) Once NumPy adds the framework and initial set of Universal Intrinsic, 
if
    contributors want to leverage a new architecture specific SIMD
    instruction, will they be expected to add software implementation
    of this instruction for all other architectures too?

    In my opinion, if the instructions are lower, then yes. For
    example, one cannot add AVX-512 without adding, for example adding
    AVX-256 and AVX-128 and SSE*.  However, I would not expect one
    person or team to be an expert in all assemblies, so intrinsics
    for one architecture can be developed independently of another.


I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs |"VEXP2PD"and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework.|
|
|
|This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right?
|
|Yes. Universal intrinsics are cross-platform. However the NEP is open to the possibility that certain architectures may have SIMD intrinsics that cannot be expressed in terms of intrinsics for other platforms, and so there may be a use case for architecture-specific loops. This is explicitly stated in the latest PR to the NEP: "|If the regression is not minimal, we may choose to keep the X86-specific code for that platform and use the universal intrisic code for other platforms."
|
|


    > 2) On whom does the burden lie to ensure that new
    implementations are benchmarked and shows benefits on every
    architecture? What happens if optimizing an Ufunc leads to
    improving performance on one architecture and worsens performance
    on another?


This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged.


On HEAD, out of the 89 ufuncs in numpy.core.code_generators.generate_umath.defdict, 34 have X86-specific simd loops:


>>> [x for x in defdict.keys() if any([td.simd for td in defdict[x].type_descriptions])] ['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal', 'absolute', 'negative', 'greater', 'greater_equal', 'less', 'less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not', 'logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or', 'bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin', 'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint']


They would be the first targets for universal intrinsics. Of them I estimate that the ones with more than one loop for at least one dtype signature would be the most difficult, since these have different optimizations for avx2, fma, and/or avx512f:


['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint']


The other 55 ufuncs, for completeness, are


['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power', 'float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip', 'fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees', 'rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2', 'expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2', 'remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite', 'signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp', 'gcd', 'lcm', 'matmul']


As for testing accuracy: we recently added a framework for testing ulp variation of ufuncs against "golden results" in numpy/core/tests/test_umath_accuracy. So far float32 is tested for exp, log, cos, sin. Others may be tested elsewhere by specific tests, for instance numpy/core/test/test_half.py has test_half_ufuncs.


It is difficult to do benchmarking on CI: the machines that run CI vary too much. We would need to set aside a machine for this and carefully set it up to keep CPU speed and temperature constant. We do have benchmarks for ufuncs (they could always be improved). I think Pauli runs the benchmarks carefully on X86, and may even makes the results public, but that resource is not really on PR reviewers' radar. We could run benchmarks on the gcc build farm machines for other architectures. Those machines are shared but not heavily utilized.


Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think.



    I would look at this from a maintainability point of view. If we
    are increasing the code size by 20% for a certain ufunc, there
    must be a domonstrable 20% increase in performance on any CPU.
    That is to say, micro-optimisation will be unwelcome, and code
    readability will be preferable. Usually we ask the submitter of
    the PR to test the PR with a machine they have on hand, and I
    would be inclined to keep this trend of self-reporting. Of course,
    if someone else came along and reported a performance regression
    of, say, 10%, then we have increased code by 20%, with only a net
    5% gain in performance, and the PR will have to be reverted.

    —snip—


I think we should be careful not to increase the reviewer burden, and try to automate as much as possible. It would be nice if we could at some point set up a set of bots that can be triggered to run benchmarks for us and report in the PR the results.


Matti

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to