On Thu, 25 Jan 2024, 06:22 Hans Henrik Bergan, <divinit...@gmail.com> wrote:

> On Wed, 24 Jan 2024 at 17:59, Marco Pivetta <ocram...@gmail.com> wrote:
> >
> > Depends on the actual numbers: is there any way to make a comparison that
> > is relatively stable across architectures?
> >
> > Would it be feasible to start with the
> > cross-platform-let-the-compiler-do-its-job version (that somebody may
> > actually be capable of auditing), and then introduce other versions when
> > the jump is significant enough?
> >
>
> don't know about "relatively stable across architectures" but wrote
> some benchmarking code, keep reading.
>
>
>
> On Wed, 24 Jan 2024 at 17:55, tag Knife <fennic...@gmail.com> wrote:
> > Should we even be considering the specific instruction implementations?
> > I've always been in the camp
> > of you are not smarter than the compiler. As even the best human written
> > ASM code can be slower
> > than the obscure instructions the compiler might choose to use in a weird
> > and wonderful way.
>
> The BLAKE3 team is smarter than GCC11.4, even with -march=native
> -mtune=native, which is *not* commonly used in PHP,
> the compiler didn't stand a chance against the hand-optimized assembly
> versions,
>
> wrote some benchmarks, but the TL;DR is:
> portable -O2 usually used by PHP managed 1126MB/s,
> portable -O2 -march=native managed 533MB/s (wtf? gcc obviously got
> something wrong here),
> hand-written -O2 SSE2  managed 3144MB/s,
> hand-written -O2 SSE41 managed 3332MB/s,
> hand-written -O2 avx2 managed 6554MB/s,
> hand-writen -O2 AVX512 managed 8913MB/s,
> on my AMD Ryzen 9 7950x,
> benchmarking code:
> https://gist.github.com/divinity76/5729472dd5d77e94cd0acb245aac2226
> raw output:
> array(6) {
>   ["O2-portable-march"]=>
>   array(2) {
>     ["microseconds_for_16_kib"]=>
>     int(29295)
>     ["mb_per_second"]=>
>     float(533.3674688513398)
>   }
>   ["O2-portable"]=>
>   array(2) {
>     ["microseconds_for_16_kib"]=>
>     int(13876)
>     ["mb_per_second"]=>
>     float(1126.0449697319111)
>   }
>   ["O2-sse2"]=>
>   array(2) {
>     ["microseconds_for_16_kib"]=>
>     int(4969)
>     ["mb_per_second"]=>
>     float(3144.4958744214127)
>   }
>   ["O2-sse41"]=>
>   array(2) {
>     ["microseconds_for_16_kib"]=>
>     int(4688)
>     ["mb_per_second"]=>
>     float(3332.977815699659)
>   }
>   ["O2-avx2"]=>
>   array(2) {
>     ["microseconds_for_16_kib"]=>
>     int(2384)
>     ["mb_per_second"]=>
>     float(6554.1107382550335)
>   }
>   ["O2-avx512"]=>
>   array(2) {
>     ["microseconds_for_16_kib"]=>
>     int(1753)
>     ["mb_per_second"]=>
>     float(8913.291500285226)
>   }
> }
>

Oh yes, the AVX jump is impressive 😵

Reply via email to