[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #14 from Patrik Huber  ---
It even seems a few percent slower after the FDO stuff. But the `
-fprofile-use` is a bit weird. If there is no .gcda file, it doesn't complain.
If you give it a file that doesn't exist (e.g. -fprofile-use=foo), then it
doesn't complain either. So how can I check whether it really ran the FDO?

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #13 from Patrik Huber  ---
>> Did you try with FDO?  (-fprofile-generate, run, -fprofile-use)

I just tried this with g++-7. It didn't help, the final executable has the same
slower run time as in the attached log without the FDO.

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #10 from Patrik Huber  ---
Created attachment 43367
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43367&action=edit
gcc5_gemm_test.ii

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #11 from Patrik Huber  ---
Created attachment 43368
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43368&action=edit
gcc7_gemm_test.ii

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #8 from Patrik Huber  ---
Created attachment 43366
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43366&action=edit
full_log.txt

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #7 from Patrik Huber  ---
Created attachment 43365
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43365&action=edit
gemm_test.cpp

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #6 from Patrik Huber  ---
I could also upload you the .ii files but they are 5 MB, which the bugtracker
doesn't allow (1 MB limit).

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #5 from Patrik Huber  ---
Created attachment 43364
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43364&action=edit
gcc7_gemm_test.s

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #4 from Patrik Huber  ---
Created attachment 43363
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43363&action=edit
gcc5_gemm_test.s

[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

--- Comment #3 from Patrik Huber  ---
@Richard: I'm not 100% sure what you mean with "preprocessed source" but I
googled and you probably mean the output of compiling with "-c -save-temps".

Please see attached.

[Bug c++/84280] New: Performance regression in g++-7 with Eigen for non-AVX2 CPUs

2018-02-08 Thread patrikhuber at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280

Bug ID: 84280
   Summary: Performance regression in g++-7 with Eigen for
non-AVX2 CPUs
   Product: gcc
   Version: 7.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: patrikhuber at gmail dot com
  Target Milestone: ---

Hello,

I noticed today what may look like quite a large performance regression
with Eigen (3.3.4) matrix multiplication. It only seems to occur on
non-AVX2 code paths, meaning that if I compile with -march=native on my
core-i7 with AVX2, then it's blazingly fast on both g++ versions, but not
on an older core-i5 with only AVX, or if I use -march=core2.

Here are some example timings, but it applies to all matrix sizes that the
benchmark script tests (see end of the message for the code):

g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -o
gcc5_gemm_test

1124 1215 1465
elapsed_ms: 1970

1730 1235 1758
elapsed_ms: 3505

g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3
-march=core2 -o gcc7_gemm_test

1124 1215 1465
elapsed_ms: 2998

1730 1235 1758
elapsed_ms: 4628

It's even worse if I test this on a i5-3550, which has AVX, but not AVX2:

g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc5_gemm_test
1124 1215 1465
elapsed_ms: 941

1730 1235 1758
elapsed_ms: 1780


g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test

1124 1215 1465
elapsed_ms: 1988

1730 1235 1758
elapsed_ms: 3740

I tried the same with -O2 and it gave the same results. That's a drop to
nearly half the speed in matrix multiplication on AVX CPUs. Or maybe I've
done something wrong. :-) I realise the benchmark might be a bit crude
(better use Google Benchmark or something like that...) But the results I'm
getting are pretty consistent on various CPUs, compilers, and with various
flags.


=== Benchmark code:
// gemm_test.cpp
#include 
#include 
#include 
#include 
#include 

using RowMajorMatrixXf = Eigen::Matrix;
using ColMajorMatrixXf = Eigen::Matrix;

template 
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing
everything away
const auto start_time_ns =
high_resolution_clock::now().time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns =
high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 100;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " <<
elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test("col major", s1, s2, s3);
run_test("row major", s1, s2, s3);
std::cout << "" << std::endl;
}
return 0;
}
===