[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread xry111 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #13 from Xi Ruoyao  ---
(In reply to David Malcolm from comment #10)
> (In reply to Jan Hubicka from comment #4)
> > I keep mentioning to Larabel that he should use -fno-semantic-interposition,
> > but he doesn't.
> 
> Possibly a silly question, but how about changing the default in GCC 15? 
> What proportion of users actually make use of -fsemantic-interposition ?

At least if building Glibc with -fno-semantic-interposition, several tests will
fail.  I've not figured out if they are test-suite issues or real issues
though.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #12 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #11)
> (In reply to David Malcolm from comment #10)
> > (In reply to Jan Hubicka from comment #4)
> > > I keep mentioning to Larabel that he should use 
> > > -fno-semantic-interposition,
> > > but he doesn't.
> > 
> > Possibly a silly question, but how about changing the default in GCC 15? 
> > What proportion of users actually make use of -fsemantic-interposition ?
> 
> See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for
> previous discussion on this.

Sorry
https://inbox.sourceware.org/gcc-patches/20210606231215.49899-1-mask...@google.com/

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #11 from Andrew Pinski  ---
(In reply to David Malcolm from comment #10)
> (In reply to Jan Hubicka from comment #4)
> > I keep mentioning to Larabel that he should use -fno-semantic-interposition,
> > but he doesn't.
> 
> Possibly a silly question, but how about changing the default in GCC 15? 
> What proportion of users actually make use of -fsemantic-interposition ?

See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for
previous discussion on this.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread dmalcolm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

David Malcolm  changed:

   What|Removed |Added

 CC||dmalcolm at gcc dot gnu.org

--- Comment #10 from David Malcolm  ---
(In reply to Jan Hubicka from comment #4)
> I keep mentioning to Larabel that he should use -fno-semantic-interposition,
> but he doesn't.

Possibly a silly question, but how about changing the default in GCC 15?  What
proportion of users actually make use of -fsemantic-interposition ?

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #9 from Jan Hubicka  ---
Phoronix still claims the difference
https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #8 from Richard Biener  ---
Created attachment 57006
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57006=edit
unroll heuristics

this one

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #7 from Richard Biener  ---
IMO it should be purely growth/unrolled-insns bound, the bound on the actual
unrolled iterations is somewhat artificial (to avoid really large unrolls
when we estimate the unrolled body to be zero, thus never hit any of the other
limits).  That said, I think we should get better at estimating growth - I
don't
think we get that the reads from the constant arrays get elided?  (though
that's
not always an optimal thing)

See the proposal on better estimation I had last year.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #6 from Jan Hubicka  ---
The internal loops are:

static const unsigned keccakf_rotc[24] = {
   1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18,
39, 61, 20, 44
}; 

static const unsigned keccakf_piln[24] = {
   10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14,
22, 9, 6, 1
};

static void keccakf(ulong64 s[25])
{  
   int i, j, round;
   ulong64 t, bc[5];

   for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) {
  /* Theta */
  for(i = 0; i < 5; i++)
 bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20];

  for(i = 0; i < 5; i++) { 
 t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1);
 for(j = 0; j < 25; j += 5)
s[j + i] ^= t;
  }
  /* Rho Pi */
  t = s[1];
  for(i = 0; i < 24; i++) {
 j = keccakf_piln[i];
 bc[0] = s[j];
 s[j] = ROL64(t, keccakf_rotc[i]);
 t = bc[0];
  }
  /* Chi */
  for(j = 0; j < 25; j += 5) {
 for(i = 0; i < 5; i++)
bc[i] = s[j + i];
 for(i = 0; i < 5; i++)
s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5];
  }
  s[0] ^= keccakf_rndc[round];
   }
}

I suppose with complete unrolling this will propagate, partly stay in registers
and fold. I think increasing the default limits, especially -O3 may make sense.
Value of 16 is there for very long time (I think since the initial
implementation).

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Jan Hubicka  changed:

   What|Removed |Added

Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark
   |is almost 40% slower vs.|is almost 40% slower vs.
   |Clang   |Clang (not enough complete
   ||loop peeling)

--- Comment #5 from Jan Hubicka  ---
On my zen3 machine default build gets me 180MB/S
-O3 -flto -funroll-all-loops gets me 193MB/s
-O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone
with --param max-completely-peel-times=20, default is 16.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #4 from Jan Hubicka  ---
I keep mentioning to Larabel that he should use -fno-semantic-interposition,
but he doesn't.

Profile is very simple:

 96.75%  SMHasher[.] keccakf.lto_priv.0
  ◆

All goes to simple loop. On Zen3 gcc 13 -march=native -Ofast -flto I get:

  3.85 │330:   mov%r8,%rdi  
  7.68 │   movslq (%rsi,%r9,1),%rcx 
  3.85 │   lea(%rax,%rcx,8),%r10
  3.86 │   mov(%rdx,%r9,1),%ecx 
  3.83 │   add$0x4,%r9  
  3.86 │   mov(%r10),%r8
  7.37 │   rol%cl,%rdi  
  7.37 │   mov%rdi,(%r10)   
  4.76 │   cmp$0x60,%r9 
  0.00 │ ↑ jne330   


Clang seems to unroll it:

 0.25 │ d0:   mov  -0x48(%rsp),%rdx
  ▒
  0.25 │   xor  %r12,%rcx  
   ▒
  0.25 │   mov  %r13,%r12  
   ▒
  0.25 │   mov  %r13,0x10(%rsp)
   ▒
  0.25 │   mov  %rax,%r13  
   ◆
  0.26 │   xor  %r15,%r13  
   ▒
  0.23 │   mov  %r11,-0x70(%rsp)   
   ▒
  0.25 │   mov  %r8,0x8(%rsp)  
   ▒
  0.25 │   mov  %r15,-0x40(%rsp)   
   ▒
  0.25 │   mov  %r10,%r15  
   ▒
  0.26 │   mov  %r10,(%rsp)
   ▒
  0.26 │   mov  %r14,%r10  
   ▒
  0.25 │   xor  %r12,%r10  
   ▒
  0.26 │   xor  %rsi,%r15  
   ▒
  0.24 │   mov  %rbp,-0x80(%rsp)   
   ▒
  0.25 │   xor  %rcx,%r15  
   ▒
  0.26 │   mov  -0x60(%rsp),%rcx   
   ▒
  0.25 │   xor  -0x68(%rsp),%r15   
   ▒
  0.26 │   xor  %rbp,%rdx  
   ▒
  0.25 │   mov  -0x30(%rsp),%rbp   
   ▒
  0.25 │   xor  %rdx,%r13  
   ▒
  0.24 │   mov  -0x10(%rsp),%rdx   
   ▒
  0.25 │   mov  %rcx,%r12  
   ▒
  0.24 │   xor  %rcx,%r13  
   ▒
  0.25 │   mov  $0x1,%ecx  
   ▒
  0.25 │   xor  %r11,%rdx  
   ▒
  0.24 │   mov  %r8,%r11   
   ▒
  0.25 │   mov  -0x28(%rsp),%r8
   ▒
  0.26 │   xor  -0x58(%rsp),%r8
   ▒
  0.24 │   xor  %rdx,%r8   
   ▒
  0.26 │   mov  -0x8(%rsp),%rdx
   ▒
  0.25 │   xor  %rbp,%r8   
   ▒
  0.26 │   xor  %r11,%rdx  
   ▒
  0.25 │   mov  -0x20(%rsp),%r11   
   ▒
  0.25 │   xor  %rdx,%r10  
   ▒

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang

2024-01-04 Thread xry111 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Xi Ruoyao  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||xry111 at gcc dot gnu.org
   Last reconfirmed||2024-01-04
Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark
   |is almost 40% slower vs.|is almost 40% slower vs.
   |Clang on AMD Zen 4  |Clang
 Status|UNCONFIRMED |NEW

--- Comment #3 from Xi Ruoyao  ---
GCC trunk still gets around 200 (on a Tiger Lake but I've not used -march) with
-fno-semantic-interposition.

Confirm, and I'm removing "on xxx" from the subject as the uarch seems
irrelevant.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang on AMD Zen 4

2024-01-04 Thread xry111 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #2 from Xi Ruoyao  ---
The test file can be downloaded from
http://phoronix-test-suite.com/benchmark-files/smhasher-20220822.tar.xz.  Just
build it with cmake and run "./SMHasher --test=Speed sha3-256".  The building
system enables -O3 and LTO by default.

With GCC 13 I get about 180 MiB/s, but Clang 17 produces 250 MiB/s.

Part of the difference is caused by the different -fsemantic-interposition
default, if I pass -fno-semantic-interposition GCC 13 produces about 200 MiB/s.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang on AMD Zen 4

2024-01-04 Thread aros at gmx dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #1 from Artem S. Tashkinov  ---
Also valid for MTL:
https://www.phoronix.com/review/intel-meteorlake-gcc-clang/2