> On Feb 14, 2021, at 12:04 PM, Barry Smith <[email protected]> wrote: > > > For our handcoded AVX functions this is fine, we can handle the dispatching > ourselves.
Cool. _may_i_use_cpu_feature() would be very useful to determine the optimal AVX code path at runtime. Theoretically we just need to query for the needed features once and cache the results. > > But what about all the tons of regular code in PETSc, somehow we need to > have the same function compiled twice and dispatched properly. Do we use what > Hong suggested with fat binaries? So fat-binaries PLUS _may_i_use_cpu_feature > together are the way to portable transportable libraries? > > > And we do this always --with-debugging=0 so everyone, packages and users get > portable but also the best performance possible. IMHO, only package managers should consider using -ax options. On our side, if we want to satisfy the needs of different parties (developers, users, package managers), better be conservative than aggressive. -march=native brings huge performance improvement but it has never been the default for many compilers with a good reason. Even -O3 does not enable the advanced vector instructions. I just did a quick check on petsc-02: hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | grep SSE #define __SSE__ 1 #define __SSE_MATH__ 1 #define __SSE2__ 1 #define __SSE2_MATH__ 1 hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | grep avx hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ What Jed usually does (--with-debugging=0 COPTFLAGS='-O2 -march=native’) can be suggested to anyone who does not need to care about portability. If you do not want users to specify the magic options, perhaps we can provide a configure option like --with-portability. If it is set to false, we add aggressive flags automatically. Hong > > Barry > > >> On Feb 14, 2021, at 11:50 AM, Jed Brown <[email protected]> wrote: >> >>> >> >> immintrin.h provides >> >> if (_may_i_use_cpu_feature(_FEATURE_FMA | _FEATURE_AVX2) { >> fancy_version_that_needs_fma_and_avx2(); >> } else { >> fallback_version(); >> } >> >> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_may_i_use&expand=3677,3677 >> >> I believe this function is slightly expensive because it probably calls the >> CPUID instruction each time. BLIS has code to cache the result and query >> features with simple bitwise math. >> >> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h >> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c >> >> Of course this bit of dispatch should typically be done at object creation >> time, not every iteration. >
