Hi Daniel, Thanks a lot for the update and your continued help with this!
Rasmus On Fri, Apr 23, 2021 at 3:23 AM <[email protected]> wrote: > > Hello, > > after a lot of measuring, profiling, and digging a small update: > > The Closure benchmark slow-down has been reported with a repo as > https://gitlab.com/libeigen/eigen/-/issues/2225 . > > Interestingly, this slow-down also seems to affect the cases where we're > mainly working with vectors / matrices of AutoDiffScalar<Dynamic Matrix with > large in-line storage>, see for example the differences of Jac/FV/RANSSAneg > for "clang 3.4-587a6915" vs "clang 3.4-587a6915 EIGEN_UNALIGNED_VECTORIZE=0". > So this probably isn't just due to the compiler being able to keep stuff in > registers more (as the AutoDiffScalars are way too large for that anyway). > > One large problem that causes a big discrepancy in the Jac/ tests was naive > copying of (unused) storage for dynamically sized storage, reported as > https://gitlab.com/libeigen/eigen/-/issues/2229 and already fixed. As you can > see, the fix improves performance for the Jac/ tests a lot. Thank you! > > Some cases are still a fair bit worse (Res/FV/Euler, Res/FV/RANSSAneg, > Res/DG/Euler, Jac/FV/NavierStokes) but overall it's not looking as bad as it > used to. :) > > I'm still investigating some crashes when using 3.4 (which are not there with > 3.2) but I cannot yet say whether that is a problem in Eigen 3.4 or somewhere > else. > > > > Best regards > > Daniel Vollmer > > -------------------------- > Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR) > German Aerospace Center > Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 > Braunschweig | Germany > > Daniel Vollmer | AS C²A²S²E > www.DLR.de > > ________________________________________ > From: [email protected] <[email protected]> > Sent: Monday, 19 April 2021 16:27:51 > To: [email protected] > Subject: Re: [eigen] Eigen 3.4 release candidate 1! > > Hello all, > > I've compiled our CFD Code with Eigen 3.4-rc1 and ran a few benchmarks > relative to Eigen 3.2.9. > > Unfortunately, I'm seeing the negative performance impact introduced by Eigen > 3.3 continue to exist (which is why we're still hanging on to the 3.2 branch). > > I've compiled our code with Eigen 3.2.9 as well as 3.4-rc1 with both clang > and gcc, on a macOS x86-64 system (Big Sur) with the following CPU feature > flags > machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE > MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 > PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 > SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 > RDRAND F16C > machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 FDPEO SMEP > BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP > CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI MDCLEAR IBRS STIBP > L1DF ACAPMSR SSBD > > Compilation settings were: -std=c++17 -fopenmp -Ofast -march=native > -fno-finite-math-only -DEIGEN_DONT_PARALLELIZE=1 > (we split our work up among threads ourselves, but the benchmarks are > single-threaded anyway). > > The clang version was homebrew's llvm: stable 11.1.0 (bottled); gcc was > homebrew's gcc: stable 10.2.0 (bottled). > > We're using Eigen for small, fixed-size vectors and occasionally matrices of > doubles (of varying lengths, e.g. 5, 6, 7, 8, 13, ...), mainly accessing > individual elements, or fixed-size (length 3) segments at compile-time known > offsets (indices). Occasionally, we also have matrix-vector products of > these, but they probably play a smaller role. These are the "Res" (Residual > computation) benchmarks, where we do this over a whole mesh with multiple > loops, gradient computations, ... > > Then we do the same thing but use Eigen's (unsupported) AutoDiffScalar using > a fairly big fixed-max-size (e.g. when our PDE has 6 state variables, then > AutoDiffScalar has Eigen::Dynamic derivatives with a max-size of 6 * 24 + 1). > These types are quite large, but it's still faster than fully dynamic heap > allocation. These are the "Jac" (Jacobian computation) benchmarks, which > otherwise largely mimic the "Res" benchmarks. > > https://www.maven.de/stuff/coda_benchmarks_eigen.pdf > > In these graphs everything is relative to the clang Eigen 3.2.9 baseline > being 1.0. Higher numbers are faster (2.0 would be twice as fast as the clang > Eigen 3.2.9 using build). > > The first two benchmarks (Closure/...) are actual micro-benchmarks. In the > first one, we have an input vector with 6 doubles, and from that compute a > vector with 13 doubles (the first six are the same as the input, and the > remaining 7 are derived values computed from those input values). The 2nd > version also augments gradients, which means it also has input gradients > (Matrix<Matrix<double, 3, 1>, 6, 1>) and outputs augmented gradients > (Matrix<Matrix<double, 3, 1>, 13, 1>). > > The next set of benchmarks is the residual computation (which performs > multiple loops over a mesh with various computations). One (small) part of > this is the Closure-stuff from the first two benchmarks. > > The final set is the computation / approximation of the derivative using > AutoDiffScalar instead of double on a local level. > > For these benchmarks (Res and Jac), the sizes of the vectors & matrices > roughly increases for each set of equations (Euler <= NavierStokes < > RANSSAneg < ...). > > For the first microbenchmark Closure/AugmentGF we seem to be hitting a > pathological case in the partial vectorization as disabling that seems to fix > the problem. The input vector contains 6 double, the output 13. Remember this > mainly uses individual values or segment<3> access to the momentum-sub-vector. > > The bigger "Res" benchmarks are generally a bit slower (even though not > majorly so), but more so with gcc. > > The even bigger "Jac" benchmarks do see big slow-downs (performance 0.7 down > do 0.3), which seem to get worse as we include more (and larger) > AutoDiffScalars. We didn't see this relative decrease for the otherwise > roughly similar "Res" benchmarks, so something seems to be strange there > (either in AutoDiffScalar, or in the use of custom scalar types itself). I'm > wondering whether some Eigen internal operations may accidentally (or > purposely) create more (temporary) copies of scalars, which for a custom type > might be costly... > > > Unfortunately providing minimized, self-contained reproducers exhibiting the > same behavior is quite difficult... :( > > I will try to work on figuring out a reproducer for the first microbenchmark > where partial vectorization has a negative effect. > > > If anyone has any ideas what could be happening (or things they would like me > to try), I'm all ears. We really would like to move to a current Eigen > version! > > > Best regards > > Daniel Vollmer > > -------------------------- > Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR) > German Aerospace Center > Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 > Braunschweig | Germany > > Daniel Vollmer | AS C²A²S²E > www.DLR.de > > ________________________________________ > From: Christoph Hertzberg <[email protected]> > Sent: Monday, 19 April 2021 10:14:15 > To: [email protected] > Subject: [eigen] Eigen 3.4 release candidate 1! > > Hello Eigen users! > > We are happy to announce the release of 3.4-rc1. This is a release > candidate which is considered feature complete and only will get bug > fixes until 3.4.0 is released. > This is not yet a stable release! > > We encourage everyone, especially who is currently working with a 3.3 > version to try upgrading to 3.4-rc1 and report any regressions with any > compilers and architectures you are using (this includes run-time > errors, compile errors and compiler warnings). > Eigen should work with any C++ standard between C++03 and C++20, with > GCC 4.8 (or later), Clang 3.3 (or later), MSVC 2012 (or later), and > recent versions of CUDA, HIP, SYCL, on any target platform. If enabled > by the compiler this includes SIMD vectorization for SSE/AVX/AVX512, > AltiVec, MSA, NEON, SVE, ZVector. > > To publish test results, you can run the unit tests as described here: > [https://eigen.tuxfamily.org/index.php?title=Tests] > Please report issues until April 25th, or tell us if you are still in > the process of testing. > > > The 3.4 release will be the last Eigen version with C++03 support. > Support of the 3.3 branch may soon be stopped. > > The next major version will certainly also stop supporting some older > compiler versions. > > > Regarding the version-numbering: > There was previously a 3.4 branch with recently was deleted again, as it > was done prematurely and not properly kept up-to-date. > Unfortunately, this means that the version number (specified in > Eigen/src/Core/util/Macros.h) had to be reset to 3.3.91 (the final 3.4 > release will have version 3.4.0). Version test macros will work > incorrectly with any commit from master between 2021-02-17 and today. We > apologize for any inconvenience this may cause. > > > Cheers, > Christoph > > > > > > -- > Dr.-Ing. Christoph Hertzberg > > Besuchsadresse der Nebengeschäftsstelle: > DFKI GmbH > Robotics Innovation Center > Robert-Hooke-Straße 5 > 28359 Bremen, Germany > > Postadresse der Hauptgeschäftsstelle Standort Bremen: > DFKI GmbH > Robotics Innovation Center > Robert-Hooke-Straße 1 > 28359 Bremen, Germany > > Tel.: +49 421 178 45-4021 > Zentrale: +49 421 178 45-0 > E-Mail: [email protected] > > Weitere Informationen: http://www.dfki.de/robotik > ------------------------------------------------------------- > Deutsches Forschungszentrum für Künstliche Intelligenz GmbH > Trippstadter Straße 122, D-67663 Kaiserslautern, Germany > > Geschäftsführung: > Prof. Dr. Antonio Krüger > > Vorsitzender des Aufsichtsrats: > Dr. Gabriël Clemens > Amtsgericht Kaiserslautern, HRB 2313 > ------------------------------------------------------------- > >
