Oops, a typo in the command line. Should be AVX. SSE3 or above and AVX are not 
used for -O3.

hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep SSE
#define __SSE__ 1
#define __SSE_MATH__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep AVX
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$

> On Feb 14, 2021, at 1:25 PM, Zhang, Hong via petsc-dev 
> <[email protected]> wrote:
> 
> 
> 
>> On Feb 14, 2021, at 12:04 PM, Barry Smith <[email protected]> wrote:
>> 
>> 
>>  For our handcoded AVX functions this is fine, we can handle the dispatching 
>> ourselves. 
> 
> Cool. _may_i_use_cpu_feature() would be very useful to determine the optimal 
> AVX code path at runtime. Theoretically we just need to query for the needed 
> features once and cache the results.
> 
>> 
>> But what about all the tons of regular code in PETSc, somehow we need to 
>> have the same function compiled twice and dispatched properly. Do we use 
>> what Hong suggested with fat binaries? So fat-binaries PLUS 
>> _may_i_use_cpu_feature together are the way to portable transportable 
>> libraries?
>> 
>> 
>> And we do this always --with-debugging=0 so everyone, packages and users get 
>> portable but also the best performance possible.
> 
> IMHO, only package managers should consider using -ax options. On our side, 
> if we want to satisfy the needs of different parties (developers, users, 
> package managers), better be conservative than aggressive. -march=native 
> brings huge performance improvement but it has never been the default for 
> many compilers with a good reason. Even -O3 does not enable the advanced 
> vector instructions. I just did a quick check on petsc-02: 
> 
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null 
> | grep SSE
> #define __SSE__ 1
> #define __SSE_MATH__ 1
> #define __SSE2__ 1
> #define __SSE2_MATH__ 1
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null 
> | grep avx
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ 
> 
> What Jed usually does (--with-debugging=0 COPTFLAGS='-O2 -march=native’) can 
> be suggested to anyone who does not need to care about portability. If you do 
> not want users to specify the magic options, perhaps we can provide a 
> configure option like --with-portability. If it is set to false, we add 
> aggressive flags automatically.
> 
> Hong
> 
>> 
>> Barry
>> 
>> 
>>> On Feb 14, 2021, at 11:50 AM, Jed Brown <[email protected]> wrote:
>>> 
>>>> 
>>> 
>>> immintrin.h provides
>>> 
>>> if (_may_i_use_cpu_feature(_FEATURE_FMA | _FEATURE_AVX2) {
>>> fancy_version_that_needs_fma_and_avx2();
>>> } else {
>>> fallback_version();
>>> }
>>> 
>>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_may_i_use&expand=3677,3677
>>> 
>>> I believe this function is slightly expensive because it probably calls the 
>>> CPUID instruction each time. BLIS has code to cache the result and query 
>>> features with simple bitwise math.
>>> 
>>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h
>>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c
>>> 
>>> Of course this bit of dispatch should typically be done at object creation 
>>> time, not every iteration.
>> 
> 

Reply via email to