Thanks for the results, looks like I'll need to get access to some older
hardware and try some different combinations. There's a few things I can
tune (loading all 8 values at the start vs loading one per fold, different
BUFSIZE values), I'd be interested in finding a setup that definitely
offers an improvement across the board.

Did you test this with the first patch or the second patch? At a minimum
cutting out the final table-based fold should be a consistent ~5%
improvement on any platform.

On Wed, Dec 25, 2024, 17:45 Michael Stone <mst...@debian.org> wrote:

> On Tue, Dec 24, 2024 at 11:52:38PM +0000, Pádraig Brady wrote:
> >However this is a regression on i7-5600U at least:
>
> I'm seeing the same on older consumer hardware, even after the latest
> patch (i3-6100):
>
> $ time ./cksum --debug /tmp/testfil
> cksum: avx512 support not detected
> cksum: avx2 support not detected
> cksum: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m27.717s
> user    0m4.519s
> sys     0m23.173s
>
> $ time ./cksum_chorba --debug /tmp/testfil
> cksum_chorba: avx512 support not detected
> cksum_chorba: avx2 support not detected
> cksum_chorba: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m31.288s
> user    0m6.863s
> sys     0m24.404s
>
>
> on older server hardware (E3-1240 v5) I do see a slight improvement in
> user time, but the system time increases and *not once* did I see the
> overall runtime decrease (I did run them in the opposite order as well).
> Maybe this indicates that the change trashes the cpu cache or somesuch?
>
> $ time ./cksum --debug /tmp/testfil
> cksum: avx512 support not detected
> cksum: avx2 support not detected
> cksum: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m22.923s
> user    0m3.956s
> sys     0m18.867s
>
> $ time ./cksum_chorba --debug /tmp/testfil
> cksum_chorba: avx512 support not detected
> cksum_chorba: avx2 support not detected
> cksum_chorba: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m23.962s
> user    0m3.768s
> sys     0m20.165s
>
> $ time ./cksum_chorba --debug /tmp/testfil
> cksum_chorba: avx512 support not detected
> cksum_chorba: avx2 support not detected
> cksum_chorba: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m25.021s
> user    0m3.776s
> sys     0m21.235s
>
> $ time ./cksum --debug /tmp/testfil
> cksum: avx512 support not detected
> cksum: avx2 support not detected
> cksum: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m23.961s
> user    0m4.160s
> sys     0m19.798s
>
>
> on older AMD server hardware it's closer; just as on the intel hardware
> there's a decrease in user time and an increase in system time, but the
> results are close enough that it's a wash with sometimes one being
> faster and somtimes the other:
>
> $ time ./cksum_chorba --debug /tmp/testfil
> cksum_chorba: avx512 support not detected
> cksum_chorba: avx2 support not detected
> cksum_chorba: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m14.509s
> user    0m5.083s
> sys     0m9.410s
>
> $ time ./cksum --debug /tmp/testfil
> cksum: avx512 support not detected
> cksum: avx2 support not detected
> cksum: using pclmul hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m14.220s
> user    0m5.626s
> sys     0m8.578s
>
>
>
> cf same binaries on zen4 (EPYC 9354P) where the new code is a clear
> overall improvement:
>
> $ time ./cksum --debug /tmp/testfil
> cksum: using avx512 hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m9.396s
> user    0m1.720s
> sys     0m7.676s
>
> $ time ./cksum_chorba --debug /tmp/testfil
> cksum_chorba: using avx512 hardware support
> 3018728591 68719476736 /tmp/testfil
>
> real    0m8.769s
> user    0m1.284s
> sys     0m7.485s
>
>

Reply via email to