I agree, also looking over CPU specs it looks like this is actually going to be a regression as a lot of 5-10 year old CPUs have 32-64kB L1 cache and not much more for L2 (whereas AMD is doing 3MB L2 caches which explains the boost there).
I have some old laptops at home I can play around with so I'll tune on there and submit again when I have some more confidence on the speed boost On Wed, Dec 25, 2024, 19:57 Pádraig Brady <p...@draigbrady.com> wrote: > On 25/12/2024 16:55, Sam Russell wrote: > > Thanks for the results, looks like I'll need to get access to some older > hardware and try some different combinations. There's a few things I can > tune (loading all 8 values at the start vs loading one per fold, different > BUFSIZE values), I'd be interested in finding a setup that definitely > offers an improvement across the board. > > > > Did you test this with the first patch or the second patch? At a minimum > cutting out the final table-based fold should be a consistent ~5% > improvement on any platform. > > It would be good to test chorba without also increasing the buffer size > so we're comparing just the algorithms. > > We can tweak the buffer sizes after, > though note ioblksize.h is currently set to 256KiB > so it would be good to be <= that. > > cheers, > Pádraig >