Hi Mo, I love your foot notes.
My understanding is IBMs "POWER9" CPUs 1.) have SIMD instructions[1] and 2.) are used by the new, and very cool, open *hardware* Talos II workstations[2], which 3.) already run Debian. FYI, Kingsley [1] POWER9 https://en.wikipedia.org/wiki/POWER9 [2] Introducing Talos II https://www.raptorcs.com/TALOSII/ On 02/08/2019 16:25, Mo Zhou wrote: > Hi folks, > > For most programs the "-march=native" option is not expected to bring any > significant performance improvement. However for some scientific applications > this proposition doesn't hold. When I was creating the tensorflow debian > package, I observed a significant performance gap between generic code and > kabylake (Intel 7XXX Series) code[1]. > > The significant improvement in performance basically stems from the Eigen > library (header only numerical linear algebra library). Here is a simple > example[2] for demonstrating the performance gap[3] between different ISA > baselines. (elapsed time is roughly measured with "perf stat ...") > > Having seen such interesting results, I immediately created a Debian partial > fork named SIMDebian (SIMD + Debian)[0]. It makes great sense to some > applications due to the significant performance gain brought by SIMD code. > Currently this partial fork is still in the very early stage, and it needs > > * More experience about software that benefit a lot from SIMD code > (e.g. What package would potentially benefit from SIMD code?) > * Suggestions and comments > (e.g. Is such a partial fork really useful and valuable?) > * More people interested in this > > SIMDebian is only a PARTIAL fork, which means that it only takes care of > packages that would obviously benefit from SIMD code, because no performance > gain is expected in terms of the majority of packages in the Debian archive. > > Generally speaking, in order to bump the ISA baseline for a given package, one > could add the -march=xxx flag to {C,CXX,F}FLAGS by modifying debian/rules. > However SIMDebian employes a more economic approach to this end: forking > dpkg[5] and injecting -march=xxx flag to the system default flag list. With > the > resulting dpkg package, most debian packages could be rebuilt with bumped ISA > baseline without any code modification. > > I think Debian Science team is interested in this partial fork as well. In the > past there was a highly-related GSoC project[4] (In my fuzzy memory the topic > lead to the creation of the GSoC project was raised by me). However for some > reason (I forgot it) it didn't start. > > This is the first time I try to fork Debian and apparently I have no > experience > on running a fork. I need comments from especially the Debian Science Team. > Any response/pointer would be much appreciated! > > P.S. SIMDebian has an alias: SIGILLbian (SIGILL + Debian). > ------------------------------------------------------------------------------- > > [0] https://github.com/SIMDebian/SIMDebian > > [1] > https://github.com/SIMDebian/SIMDebian/blob/master/benchmarks/tensorflow.md > > [2] ```c++ > #include <iostream> > #include <Eigen/Dense> > using namespace std; > > #define N 4096 > int main(void) > { > auto A = Eigen::MatrixXd::Random(N, N); > auto B = Eigen::MatrixXd::Random(N, N); > auto C = A * B; > //cout << A << endl << B << endl << C << endl; > (void) C(0,0); > return 0; > } > ``` > > [3] ``` (command-line) (perf-stat-elapsed-time) > CPU: Intel I5-7440HQ > > g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake \ > -DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt > 1.275162977 (seconds) > > g++ a.cc -I/usr/include/eigen3 -O2 \ > -DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt > 1.382608279 > > g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake -fopenmp > 1.460047514 > > g++ a.cc -I/usr/include/eigen3 -O3 -march=skylake -fopenmp > 1.313478657 > > g++ a.cc -I/usr/include/eigen3 -O2 -march=haswell -fopenmp > 1.334523068 > > g++ a.cc -I/usr/include/eigen3 -O2 -march=sandybridge -fopenmp > 1.988947143 > > g++ a.cc -I/usr/include/eigen3 -O2 -march=nehalem -fopenmp > 3.099827038 > > g++ a.cc -I/usr/include/eigen3 -O2 -march=x86-64 -fopenmp > 3.106337852 > > However, please note that Eigen's fastest result is still much slower > than OpenBLAS, even if Eigen called MKL: > > ~ ❯❯❯ julia -e 'A = rand(Float64, 4096, 4096); A*A; @time A*A;' > 1.011168 seconds (6 allocations: 128.000 MiB, 2.69% gc time) > > BLAS optimization is another story. Omitted here. > ``` > > [4] https://wiki.debian.org/SummerOfCode2017/Projects/Benchmarking > > [5] https://github.com/SIMDebian/dpkg > Currently this fork aims on "haswell" due to availability of AVX2. > Only minor modification on my patch is reqired to further bump the > baseline to e.g. icelake (AVX512). > -- Time is the fire in which we all burn.