On 11/2/20 7:16 am, Ralf Gommers wrote:
On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi
<einstein.edi...@gmail.com <mailto:einstein.edi...@gmail.com>> wrote:
—snip—
> 1) Once NumPy adds the framework and initial set of Universal Intrinsic,
if
contributors want to leverage a new architecture specific SIMD
instruction, will they be expected to add software implementation
of this instruction for all other architectures too?
In my opinion, if the instructions are lower, then yes. For
example, one cannot add AVX-512 without adding, for example adding
AVX-256 and AVX-128 and SSE*. However, I would not expect one
person or team to be an expert in all assemblies, so intrinsics
for one architecture can be developed independently of another.
I think this doesn't quite answer the question. If I understand
correctly, it's about a single instruction (e.g. one needs
|"VEXP2PD"and it's missing from the supported AVX512 instructions in
master). I think the answer is yes, it needs to be added for other
architectures as well. Otherwise, if universal intrinsics are added
ad-hoc and there's no guarantee that a universal instruction is
available for all main supported platforms, then over time there won't
be much that's "universal" about the framework.|
|
|
|This is a different question though from adding a new ufunc
implementation. I would expect accelerating ufuncs via intrinsics that
are already supported to be much more common than having to add new
intrinsics. Does that sound right?
|
|Yes. Universal intrinsics are cross-platform. However the NEP is open
to the possibility that certain architectures may have SIMD intrinsics
that cannot be expressed in terms of intrinsics for other platforms, and
so there may be a use case for architecture-specific loops. This is
explicitly stated in the latest PR to the NEP: "|If the regression is
not minimal, we may choose to keep the X86-specific code for that
platform and use the universal intrisic code for other platforms."
|
|
> 2) On whom does the burden lie to ensure that new
implementations are benchmarked and shows benefits on every
architecture? What happens if optimizing an Ufunc leads to
improving performance on one architecture and worsens performance
on another?
This is slightly hard to provide a recipe for. I suspect it may take a
while before this becomes an issue, since we don't have much SIMD code
to begin with. So adding new code with benchmarks will likely show
improvements on all architectures (we should ensure benchmarks can be
run via CI, otherwise it's too onerous). And if not and it's not
easily fixable, the problematic platform could be skipped so
performance there is unchanged.
On HEAD, out of the 89 ufuncs in
numpy.core.code_generators.generate_umath.defdict, 34 have X86-specific
simd loops:
>>> [x for x in defdict.keys() if any([td.simd for td in
defdict[x].type_descriptions])]
['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal',
'absolute', 'negative', 'greater', 'greater_equal', 'less',
'less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not',
'logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or',
'bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin',
'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint']
They would be the first targets for universal intrinsics. Of them I
estimate that the ones with more than one loop for at least one dtype
signature would be the most difficult, since these have different
optimizations for avx2, fma, and/or avx512f:
['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt',
'ceil', 'trunc', 'floor', 'rint']
The other 55 ufuncs, for completeness, are
['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power',
'float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip',
'fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees',
'rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin',
'arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2',
'expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2',
'remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite',
'signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp',
'gcd', 'lcm', 'matmul']
As for testing accuracy: we recently added a framework for testing ulp
variation of ufuncs against "golden results" in
numpy/core/tests/test_umath_accuracy. So far float32 is tested for exp,
log, cos, sin. Others may be tested elsewhere by specific tests, for
instance numpy/core/test/test_half.py has test_half_ufuncs.
It is difficult to do benchmarking on CI: the machines that run CI vary
too much. We would need to set aside a machine for this and carefully
set it up to keep CPU speed and temperature constant. We do have
benchmarks for ufuncs (they could always be improved). I think Pauli
runs the benchmarks carefully on X86, and may even makes the results
public, but that resource is not really on PR reviewers' radar. We could
run benchmarks on the gcc build farm machines for other architectures.
Those machines are shared but not heavily utilized.
Only once there's existing universal intrinsics and then they're
tweaked will we have to be much more careful I'd think.
I would look at this from a maintainability point of view. If we
are increasing the code size by 20% for a certain ufunc, there
must be a domonstrable 20% increase in performance on any CPU.
That is to say, micro-optimisation will be unwelcome, and code
readability will be preferable. Usually we ask the submitter of
the PR to test the PR with a machine they have on hand, and I
would be inclined to keep this trend of self-reporting. Of course,
if someone else came along and reported a performance regression
of, say, 10%, then we have increased code by 20%, with only a net
5% gain in performance, and the PR will have to be reverted.
—snip—
I think we should be careful not to increase the reviewer burden, and
try to automate as much as possible. It would be nice if we could at
some point set up a set of bots that can be triggered to run benchmarks
for us and report in the PR the results.
Matti
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion