> On Feb 14, 2021, at 1:25 PM, Zhang, Hong <[email protected]> wrote:
> 
> 
> 
>> On Feb 14, 2021, at 12:04 PM, Barry Smith <[email protected]> wrote:
>> 
>> 
>>  For our handcoded AVX functions this is fine, we can handle the dispatching 
>> ourselves. 
> 
> Cool. _may_i_use_cpu_feature() would be very useful to determine the optimal 
> AVX code path at runtime. Theoretically we just need to query for the needed 
> features once and cache the results.
> 
>> 
>> But what about all the tons of regular code in PETSc, somehow we need to 
>> have the same function compiled twice and dispatched properly. Do we use 
>> what Hong suggested with fat binaries? So fat-binaries PLUS 
>> _may_i_use_cpu_feature together are the way to portable transportable 
>> libraries?
>> 
>> 
>> And we do this always --with-debugging=0 so everyone, packages and users get 
>> portable but also the best performance possible.
> 
> IMHO, only package managers should consider using -ax options. On our side, 
> if we want to satisfy the needs of different parties (developers, users, 
> package managers), better be conservative than aggressive. -march=native 
> brings huge performance improvement

  But this means most our users are year after year throwing lots of 
performance on the floor and don't even know it. I think we pander for 
portability too much.

> but it has never been the default for many compilers with a good reason. Even 
> -O3 does not enable the advanced vector instructions. I just did a quick 
> check on petsc-02: 
> 
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null 
> | grep SSE
> #define __SSE__ 1
> #define __SSE_MATH__ 1
> #define __SSE2__ 1
> #define __SSE2_MATH__ 1
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null 
> | grep avx
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ 
> 
> What Jed usually does (--with-debugging=0 COPTFLAGS='-O2 -march=native’) can 
> be suggested to anyone who does not need to care about portability. If you do 
> not want users to specify the magic options, perhaps we can provide a 
> configure option like --with-portability. If it is set to false, we add 
> aggressive flags automatically.

   My feeling is 90+% of users don't care about portability, they want to get 
fast performance on the machine they are compiling with (or a collection of 
machines they have around).  

   Can we build aggressively for their system (except package managers and for 
people who provide the -march) and have PetscInitialize() produce a very useful 
error message if they then run the code on a system where it will not work? Any 
system calls to get that type of information?

  Barry



> 
> Hong
> 
>> 
>> Barry
>> 
>> 
>>> On Feb 14, 2021, at 11:50 AM, Jed Brown <[email protected]> wrote:
>>> 
>>>> 
>>> 
>>> immintrin.h provides
>>> 
>>> if (_may_i_use_cpu_feature(_FEATURE_FMA | _FEATURE_AVX2) {
>>> fancy_version_that_needs_fma_and_avx2();
>>> } else {
>>> fallback_version();
>>> }
>>> 
>>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_may_i_use&expand=3677,3677
>>> 
>>> I believe this function is slightly expensive because it probably calls the 
>>> CPUID instruction each time. BLIS has code to cache the result and query 
>>> features with simple bitwise math.
>>> 
>>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h
>>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c
>>> 
>>> Of course this bit of dispatch should typically be done at object creation 
>>> time, not every iteration.
>> 
> 

Reply via email to