> On Feb 14, 2021, at 12:04 PM, Barry Smith <[email protected]> wrote:
> 
> 
>   For our handcoded AVX functions this is fine, we can handle the dispatching 
> ourselves. 

Cool. _may_i_use_cpu_feature() would be very useful to determine the optimal 
AVX code path at runtime. Theoretically we just need to query for the needed 
features once and cache the results.

> 
>  But what about all the tons of regular code in PETSc, somehow we need to 
> have the same function compiled twice and dispatched properly. Do we use what 
> Hong suggested with fat binaries? So fat-binaries PLUS _may_i_use_cpu_feature 
> together are the way to portable transportable libraries?
> 
> 
>  And we do this always --with-debugging=0 so everyone, packages and users get 
> portable but also the best performance possible.

IMHO, only package managers should consider using -ax options. On our side, if 
we want to satisfy the needs of different parties (developers, users, package 
managers), better be conservative than aggressive. -march=native brings huge 
performance improvement but it has never been the default for many compilers 
with a good reason. Even -O3 does not enable the advanced vector instructions. 
I just did a quick check on petsc-02: 

hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep SSE
#define __SSE__ 1
#define __SSE_MATH__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep avx
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ 

What Jed usually does (--with-debugging=0 COPTFLAGS='-O2 -march=native’) can be 
suggested to anyone who does not need to care about portability. If you do not 
want users to specify the magic options, perhaps we can provide a configure 
option like --with-portability. If it is set to false, we add aggressive flags 
automatically.

Hong

> 
>  Barry
> 
> 
>> On Feb 14, 2021, at 11:50 AM, Jed Brown <[email protected]> wrote:
>> 
>>> 
>> 
>> immintrin.h provides
>> 
>> if (_may_i_use_cpu_feature(_FEATURE_FMA | _FEATURE_AVX2) {
>> fancy_version_that_needs_fma_and_avx2();
>> } else {
>> fallback_version();
>> }
>> 
>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_may_i_use&expand=3677,3677
>> 
>> I believe this function is slightly expensive because it probably calls the 
>> CPUID instruction each time. BLIS has code to cache the result and query 
>> features with simple bitwise math.
>> 
>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h
>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c
>> 
>> Of course this bit of dispatch should typically be done at object creation 
>> time, not every iteration.
> 

Reply via email to