>> I hope there will not be a demand to use many non-universal intrinsics in 
>> ufuncs, we will need to work this out on a case-by-case basis in each ufunc. 
>> Does that sound reasonable? Are there intrinsics you have already used that 
>> have no parallel on other platforms?

I think that is reasonable. It's hard to anticipate the future need and benefit 
of specialized intrinsics but I tried to make a list of some of the specialized 
intrinsics that are currently in use in NumPy that I don’t believe exist on 
other platforms (most of these actually don’t exist on AVX2 either). I am not 
an expert in ARM or VSX architecture, so please correct me if I am wrong. 

a. _mm512_mask_i32gather_ps
b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd
c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps
d. _mm512_getexp_ps
e. _mm512_getmant_ps
f. _mm512_scalef_ps
g. _mm512_permutex2var_ps, _mm512_permutex2var_pd
h. _mm512_maskz_div_ps, _mm512_maskz_div_pd
i. _mm512_permute_ps/_mm512_permute_pd 
j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google 
search I did, it seems like power ISA doesn’t have a vectorized sqrt 
instruction)

Software implementations of these instructions is definitely possible. But some 
of them are not trivial to implement and are surely not going to be one line 
macro's either. I am also unsure of what implications this has on performance, 
but we will hopefully find out once we convert these to universal intrinsic and 
then benchmark. 

Raghuveer

-----Original Message-----
From: NumPy-Discussion 
<numpy-discussion-bounces+raghuveer.devulapalli=intel....@python.org> On Behalf 
Of Matti Picus
Sent: Tuesday, February 11, 2020 11:19 PM
To: numpy-discussion@python.org
Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
>
> On top of that the performance implications aren’t clear. Software 
> implementations of hardware instructions might perform worse and might 
> not even produce the same result.
>

The proposal for universal intrinsics does not enable replacing an intrinsic on 
one platform with a software emulation on another: the intrinsics are meant to 
be compile-time defines that overlay the universal intrinsic with a platform 
specific one. In order to use a new intrinsic, it must have parallel intrinsics 
on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" 
will always return false so the compiler will not even build a loop for that 
platform. I will try to clarify that intention in the NEP.


I hope there will not be a demand to use many non-universal intrinsics in 
ufuncs, we will need to work this out on a case-by-case basis in each ufunc. 
Does that sound reasonable? Are there intrinsics you have already used that 
have no parallel on other platforms?


Matti

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to