Em ter., 5 de nov. de 2024 às 00:23, David Rowley <dgrowle...@gmail.com> escreveu:
> On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier...@gmail.com> wrote: > > I think we can add a small optimization to this last patch [1]. > > The variable *aligned_end* is only needed in the second loop (for). > > So, only before the for loop do we actually declare it. > > > > Result before this change: > > check zeros using BERTRAND 1 0.000031s > > > > Result after this change: > > check zeros using BERTRAND 1 0.000018s > > > > + const unsigned char *aligned_end; > > > > + /* Multiple bytes comparison(s) at once */ > > + aligned_end = (const unsigned char *) ((uintptr_t) end & > (~(sizeof(size_t) - 1))); > > + for (; p < aligned_end; p += sizeof(size_t)) > > I think we all need to stop using Godbolt's servers to run benchmarks > on. These servers are likely to be running various other workloads in > highly virtualised environments and are not going to be stable servers > that would give consistent benchmark results. > > I tried your optimisation in the attached allzeros.c and here are my > results: > > # My version > $ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done > char: done in 1566400 nanoseconds > size_t: done in 195400 nanoseconds (8.01638 times faster than char) > char: done in 1537500 nanoseconds > size_t: done in 196300 nanoseconds (7.8324 times faster than char) > char: done in 1543600 nanoseconds > size_t: done in 196300 nanoseconds (7.86347 times faster than char) > > # Ranier's optimization > $ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in > {1..3}; do ./allzeros; done > char: done in 1943100 nanoseconds > size_t: done in 531700 nanoseconds (3.6545 times faster than char) > char: done in 1957200 nanoseconds > size_t: done in 458400 nanoseconds (4.26963 times faster than char) > char: done in 1949500 nanoseconds > size_t: done in 469000 nanoseconds (4.15672 times faster than char) > > Seems to be about half as fast with gcc on -O2 > Thanks for test coding. I've tried with msvc 2022 32bits Here the results: C:\usr\src\tests\allzeros>allzeros char: done in 71431900 nanoseconds size_t: done in 18010900 nanoseconds (3.96604 times faster than char) C:\usr\src\tests\allzeros>allzeros char: done in 71070100 nanoseconds size_t: done in 19654300 nanoseconds (3.61601 times faster than char) C:\usr\src\tests\allzeros>allzeros char: done in 68682400 nanoseconds size_t: done in 19841100 nanoseconds (3.46162 times faster than char) C:\usr\src\tests\allzeros>allzeros char: done in 63215100 nanoseconds size_t: done in 17920200 nanoseconds (3.52759 times faster than char) C:\usr\src\tests\allzeros>c /DRANIERS_OPTIMIZATION Microsoft (R) Program Maintenance Utility Versão 14.40.33813.0 Direitos autorais da Microsoft Corporation. Todos os direitos reservados. C:\usr\src\tests\allzeros>allzeros char: done in 67213800 nanoseconds size_t: done in 15049200 nanoseconds (4.46627 times faster than char) C:\usr\src\tests\allzeros>allzeros char: done in 51505900 nanoseconds size_t: done in 13645700 nanoseconds (3.77452 times faster than char) C:\usr\src\tests\allzeros>allzeros char: done in 62852600 nanoseconds size_t: done in 17863800 nanoseconds (3.51843 times faster than char) C:\usr\src\tests\allzeros>allzeros char: done in 51877200 nanoseconds size_t: done in 13759900 nanoseconds (3.77017 times faster than char) The function used to replace clock_getime is: timespec_get(ts, TIME_UTC) best regards, Ranier Vilela