Hi Daniel,

Thanks a lot for the update and your continued help with this!

Rasmus

On Fri, Apr 23, 2021 at 3:23 AM <[email protected]> wrote:
>
> Hello,
>
> after a lot of measuring, profiling, and digging a small update:
>
> The Closure benchmark slow-down has been reported with a repo as 
> https://gitlab.com/libeigen/eigen/-/issues/2225 .
>
> Interestingly, this slow-down also seems to affect the cases where we're 
> mainly working with vectors / matrices of AutoDiffScalar<Dynamic Matrix with 
> large in-line storage>, see for example the differences of Jac/FV/RANSSAneg 
> for "clang 3.4-587a6915" vs "clang 3.4-587a6915 EIGEN_UNALIGNED_VECTORIZE=0". 
> So this probably isn't just due to the compiler being able to keep stuff in 
> registers more (as the AutoDiffScalars are way too large for that anyway).
>
> One large problem that causes a big discrepancy in the Jac/ tests was naive 
> copying of (unused) storage for dynamically sized storage, reported as 
> https://gitlab.com/libeigen/eigen/-/issues/2229 and already fixed. As you can 
> see, the fix improves performance for the Jac/ tests a lot. Thank you!
>
> Some cases are still a fair bit worse (Res/FV/Euler, Res/FV/RANSSAneg, 
> Res/DG/Euler, Jac/FV/NavierStokes) but overall it's not looking as bad as it 
> used to. :)
>
> I'm still investigating some crashes when using 3.4 (which are not there with 
> 3.2) but I cannot yet say whether that is a problem in Eigen 3.4 or somewhere 
> else.
>
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 
> Braunschweig | Germany
>
> Daniel Vollmer | AS C²A²S²E
> www.DLR.de
>
> ________________________________________
> From: [email protected] <[email protected]>
> Sent: Monday, 19 April 2021 16:27:51
> To: [email protected]
> Subject: Re: [eigen] Eigen 3.4 release candidate 1!
>
> Hello all,
>
> I've compiled our CFD Code with Eigen 3.4-rc1 and ran a few benchmarks 
> relative to Eigen 3.2.9.
>
> Unfortunately, I'm seeing the negative performance impact introduced by Eigen 
> 3.3 continue to exist (which is why we're still hanging on to the 3.2 branch).
>
> I've compiled our code with Eigen 3.2.9 as well as 3.4-rc1 with both clang 
> and gcc, on a macOS x86-64 system (Big Sur) with the following CPU feature 
> flags
> machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE 
> MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 
> PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 
> SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 
> RDRAND F16C
> machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 FDPEO SMEP 
> BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP 
> CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI MDCLEAR IBRS STIBP 
> L1DF ACAPMSR SSBD
>
> Compilation settings were: -std=c++17 -fopenmp -Ofast -march=native 
> -fno-finite-math-only -DEIGEN_DONT_PARALLELIZE=1
> (we split our work up among threads ourselves, but the benchmarks are 
> single-threaded anyway).
>
> The clang version was homebrew's llvm: stable 11.1.0 (bottled); gcc was 
> homebrew's gcc: stable 10.2.0 (bottled).
>
> We're using Eigen for small, fixed-size vectors and occasionally matrices of 
> doubles (of varying lengths, e.g. 5, 6, 7, 8, 13, ...), mainly accessing 
> individual elements, or fixed-size (length 3) segments at compile-time known 
> offsets (indices). Occasionally, we also have matrix-vector products of 
> these, but they probably play a smaller role. These are the "Res" (Residual 
> computation) benchmarks, where we do this over a whole mesh with multiple 
> loops, gradient computations, ...
>
> Then we do the same thing but use Eigen's (unsupported) AutoDiffScalar using 
> a fairly big fixed-max-size (e.g. when our PDE has 6 state variables, then 
> AutoDiffScalar has Eigen::Dynamic derivatives with a max-size of 6 * 24 + 1). 
> These types are quite large, but it's still faster than fully dynamic heap 
> allocation. These are the "Jac" (Jacobian computation) benchmarks, which 
> otherwise largely mimic the "Res" benchmarks.
>
> https://www.maven.de/stuff/coda_benchmarks_eigen.pdf
>
> In these graphs everything is relative to the clang Eigen 3.2.9 baseline 
> being 1.0. Higher numbers are faster (2.0 would be twice as fast as the clang 
> Eigen 3.2.9 using build).
>
> The first two benchmarks (Closure/...) are actual micro-benchmarks. In the 
> first one, we have an input vector with 6 doubles, and from that compute a 
> vector with 13 doubles (the first six are the same as the input, and the 
> remaining 7 are derived values computed from those input values). The 2nd 
> version also augments gradients, which means it also has input gradients 
> (Matrix<Matrix<double, 3, 1>, 6, 1>) and outputs augmented gradients 
> (Matrix<Matrix<double, 3, 1>, 13, 1>).
>
> The next set of benchmarks is the residual computation (which performs 
> multiple loops over a mesh with various computations). One (small) part of 
> this is the Closure-stuff from the first two benchmarks.
>
> The final set is the computation / approximation of the derivative using 
> AutoDiffScalar instead of double on a local level.
>
> For these benchmarks (Res and Jac), the sizes of the vectors & matrices 
> roughly increases for each set of equations (Euler <= NavierStokes < 
> RANSSAneg < ...).
>
> For the first microbenchmark Closure/AugmentGF we seem to be hitting a 
> pathological case in the partial vectorization as disabling that seems to fix 
> the problem. The input vector contains 6 double, the output 13. Remember this 
> mainly uses individual values or segment<3> access to the momentum-sub-vector.
>
> The bigger "Res" benchmarks are generally a bit slower (even though not 
> majorly so), but more so with gcc.
>
> The even bigger "Jac" benchmarks do see big slow-downs (performance 0.7 down 
> do 0.3), which seem to get worse as we include more (and larger) 
> AutoDiffScalars. We didn't see this relative decrease for the otherwise 
> roughly similar "Res" benchmarks, so something seems to be strange there 
> (either in AutoDiffScalar, or in the use of custom scalar types itself). I'm 
> wondering whether some Eigen internal operations may accidentally (or 
> purposely) create more (temporary) copies of scalars, which for a custom type 
> might be costly...
>
>
> Unfortunately providing minimized, self-contained reproducers exhibiting the 
> same behavior is quite difficult... :(
>
> I will try to work on figuring out a reproducer for the first microbenchmark 
> where partial vectorization has a negative effect.
>
>
> If anyone has any ideas what could be happening (or things they would like me 
> to try), I'm all ears. We really would like to move to a current Eigen 
> version!
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 
> Braunschweig | Germany
>
> Daniel Vollmer | AS C²A²S²E
> www.DLR.de
>
> ________________________________________
> From: Christoph Hertzberg <[email protected]>
> Sent: Monday, 19 April 2021 10:14:15
> To: [email protected]
> Subject: [eigen] Eigen 3.4 release candidate 1!
>
> Hello Eigen users!
>
> We are happy to announce the release of 3.4-rc1. This is a release
> candidate which is considered feature complete and only will get bug
> fixes until 3.4.0 is released.
> This is not yet a stable release!
>
> We encourage everyone, especially who is currently working with a 3.3
> version to try upgrading to 3.4-rc1 and report any regressions with any
> compilers and architectures you are using (this includes run-time
> errors, compile errors and compiler warnings).
> Eigen should work with any C++ standard between C++03 and C++20, with
> GCC 4.8 (or later), Clang 3.3 (or later), MSVC 2012 (or later), and
> recent versions of CUDA, HIP, SYCL, on any target platform. If enabled
> by the compiler this includes SIMD vectorization for SSE/AVX/AVX512,
> AltiVec, MSA, NEON, SVE, ZVector.
>
> To publish test results, you can run the unit tests as described here:
> [https://eigen.tuxfamily.org/index.php?title=Tests]
> Please report issues until April 25th, or tell us if you are still in
> the process of testing.
>
>
> The 3.4 release will be the last Eigen version with C++03 support.
> Support of the 3.3 branch may soon be stopped.
>
> The next major version will certainly also stop supporting some older
> compiler versions.
>
>
> Regarding the version-numbering:
> There was previously a 3.4 branch with recently was deleted again, as it
> was done prematurely and not properly kept up-to-date.
> Unfortunately, this means that the version number (specified in
> Eigen/src/Core/util/Macros.h) had to be reset to 3.3.91 (the final 3.4
> release will have version 3.4.0). Version test macros will work
> incorrectly with any commit from master between 2021-02-17 and today. We
> apologize for any inconvenience this may cause.
>
>
> Cheers,
> Christoph
>
>
>
>
>
> --
>   Dr.-Ing. Christoph Hertzberg
>
>   Besuchsadresse der Nebengeschäftsstelle:
>   DFKI GmbH
>   Robotics Innovation Center
>   Robert-Hooke-Straße 5
>   28359 Bremen, Germany
>
>   Postadresse der Hauptgeschäftsstelle Standort Bremen:
>   DFKI GmbH
>   Robotics Innovation Center
>   Robert-Hooke-Straße 1
>   28359 Bremen, Germany
>
>   Tel.:     +49 421 178 45-4021
>   Zentrale: +49 421 178 45-0
>   E-Mail:   [email protected]
>
>   Weitere Informationen: http://www.dfki.de/robotik
>    -------------------------------------------------------------
>    Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
>    Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>
>    Geschäftsführung:
>    Prof. Dr. Antonio Krüger
>
>    Vorsitzender des Aufsichtsrats:
>    Dr. Gabriël Clemens
>    Amtsgericht Kaiserslautern, HRB 2313
>    -------------------------------------------------------------
>
>


Reply via email to