Oops, a typo in the command line. Should be AVX. SSE3 or above and AVX are not used for -O3.
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | grep SSE #define __SSE__ 1 #define __SSE_MATH__ 1 #define __SSE2__ 1 #define __SSE2_MATH__ 1 hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | grep AVX hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ > On Feb 14, 2021, at 1:25 PM, Zhang, Hong via petsc-dev > <[email protected]> wrote: > > > >> On Feb 14, 2021, at 12:04 PM, Barry Smith <[email protected]> wrote: >> >> >> For our handcoded AVX functions this is fine, we can handle the dispatching >> ourselves. > > Cool. _may_i_use_cpu_feature() would be very useful to determine the optimal > AVX code path at runtime. Theoretically we just need to query for the needed > features once and cache the results. > >> >> But what about all the tons of regular code in PETSc, somehow we need to >> have the same function compiled twice and dispatched properly. Do we use >> what Hong suggested with fat binaries? So fat-binaries PLUS >> _may_i_use_cpu_feature together are the way to portable transportable >> libraries? >> >> >> And we do this always --with-debugging=0 so everyone, packages and users get >> portable but also the best performance possible. > > IMHO, only package managers should consider using -ax options. On our side, > if we want to satisfy the needs of different parties (developers, users, > package managers), better be conservative than aggressive. -march=native > brings huge performance improvement but it has never been the default for > many compilers with a good reason. Even -O3 does not enable the advanced > vector instructions. I just did a quick check on petsc-02: > > hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null > | grep SSE > #define __SSE__ 1 > #define __SSE_MATH__ 1 > #define __SSE2__ 1 > #define __SSE2_MATH__ 1 > hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null > | grep avx > hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ > > What Jed usually does (--with-debugging=0 COPTFLAGS='-O2 -march=native’) can > be suggested to anyone who does not need to care about portability. If you do > not want users to specify the magic options, perhaps we can provide a > configure option like --with-portability. If it is set to false, we add > aggressive flags automatically. > > Hong > >> >> Barry >> >> >>> On Feb 14, 2021, at 11:50 AM, Jed Brown <[email protected]> wrote: >>> >>>> >>> >>> immintrin.h provides >>> >>> if (_may_i_use_cpu_feature(_FEATURE_FMA | _FEATURE_AVX2) { >>> fancy_version_that_needs_fma_and_avx2(); >>> } else { >>> fallback_version(); >>> } >>> >>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_may_i_use&expand=3677,3677 >>> >>> I believe this function is slightly expensive because it probably calls the >>> CPUID instruction each time. BLIS has code to cache the result and query >>> features with simple bitwise math. >>> >>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h >>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c >>> >>> Of course this bit of dispatch should typically be done at object creation >>> time, not every iteration. >> >
