Thanks for the results, looks like I'll need to get access to some older hardware and try some different combinations. There's a few things I can tune (loading all 8 values at the start vs loading one per fold, different BUFSIZE values), I'd be interested in finding a setup that definitely offers an improvement across the board.
Did you test this with the first patch or the second patch? At a minimum cutting out the final table-based fold should be a consistent ~5% improvement on any platform. On Wed, Dec 25, 2024, 17:45 Michael Stone <mst...@debian.org> wrote: > On Tue, Dec 24, 2024 at 11:52:38PM +0000, Pádraig Brady wrote: > >However this is a regression on i7-5600U at least: > > I'm seeing the same on older consumer hardware, even after the latest > patch (i3-6100): > > $ time ./cksum --debug /tmp/testfil > cksum: avx512 support not detected > cksum: avx2 support not detected > cksum: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m27.717s > user 0m4.519s > sys 0m23.173s > > $ time ./cksum_chorba --debug /tmp/testfil > cksum_chorba: avx512 support not detected > cksum_chorba: avx2 support not detected > cksum_chorba: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m31.288s > user 0m6.863s > sys 0m24.404s > > > on older server hardware (E3-1240 v5) I do see a slight improvement in > user time, but the system time increases and *not once* did I see the > overall runtime decrease (I did run them in the opposite order as well). > Maybe this indicates that the change trashes the cpu cache or somesuch? > > $ time ./cksum --debug /tmp/testfil > cksum: avx512 support not detected > cksum: avx2 support not detected > cksum: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m22.923s > user 0m3.956s > sys 0m18.867s > > $ time ./cksum_chorba --debug /tmp/testfil > cksum_chorba: avx512 support not detected > cksum_chorba: avx2 support not detected > cksum_chorba: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m23.962s > user 0m3.768s > sys 0m20.165s > > $ time ./cksum_chorba --debug /tmp/testfil > cksum_chorba: avx512 support not detected > cksum_chorba: avx2 support not detected > cksum_chorba: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m25.021s > user 0m3.776s > sys 0m21.235s > > $ time ./cksum --debug /tmp/testfil > cksum: avx512 support not detected > cksum: avx2 support not detected > cksum: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m23.961s > user 0m4.160s > sys 0m19.798s > > > on older AMD server hardware it's closer; just as on the intel hardware > there's a decrease in user time and an increase in system time, but the > results are close enough that it's a wash with sometimes one being > faster and somtimes the other: > > $ time ./cksum_chorba --debug /tmp/testfil > cksum_chorba: avx512 support not detected > cksum_chorba: avx2 support not detected > cksum_chorba: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m14.509s > user 0m5.083s > sys 0m9.410s > > $ time ./cksum --debug /tmp/testfil > cksum: avx512 support not detected > cksum: avx2 support not detected > cksum: using pclmul hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m14.220s > user 0m5.626s > sys 0m8.578s > > > > cf same binaries on zen4 (EPYC 9354P) where the new code is a clear > overall improvement: > > $ time ./cksum --debug /tmp/testfil > cksum: using avx512 hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m9.396s > user 0m1.720s > sys 0m7.676s > > $ time ./cksum_chorba --debug /tmp/testfil > cksum_chorba: using avx512 hardware support > 3018728591 68719476736 /tmp/testfil > > real 0m8.769s > user 0m1.284s > sys 0m7.485s > >