> I have one use-case (although not published so less
> important) where the table size increase is a problem, and would prefer
> to use the current smaller crc.c (or even better, a smaller one that
> generate the tables on the fly, I forgot that it is possible).
There's another approach for table-less lookups that is slower than
slice-by-4 but faster than the Sarwate algorithm we currently use,
described in "Chi, M., He, D., & Liu, J. (2018). Fast Software-based
Table-less Algorithm for CRC Generation." You can do the polynomial
multiplication approach but manually execute it with 27 shifts and XORs:
uint32_t
crc32_update_no_xor_slice_by_4 (uint32_t crc, const char *buf)
{
uint32_t local_buf;
memcpy(&local_buf, buf, 4);
local_buf = le64toh(local_buf) ^ crc;
crc = local_buf ^ \
(local_buf << 6) ^ \
(local_buf << 9) ^ \
(local_buf << 10) ^ \
(local_buf << 12) ^ \
(local_buf << 16) ^ \
(local_buf << 24) ^ \
(local_buf << 25) ^ \
(local_buf << 26) ^ \
(local_buf << 28) ^ \
(local_buf << 29) ^ \
(local_buf << 30) ^ \
(local_buf << 31);
crc = crc ^ \
(crc >> 1) ^ \
(crc >> 2) ^ \
(crc >> 4) ^ \
(crc >> 5) ^ \
(crc >> 7) ^ \
(crc >> 8) ^ \
(crc >> 10) ^ \
(crc >> 11) ^ \
(crc >> 12) ^ \
(crc >> 16) ^ \
(crc >> 22) ^ \
(crc >> 23) ^ \
(crc >> 26);
return crc;
}
Is there a use case where we would need this? Either way it is faster than
the current implementation so it might be worth considering for platforms
that don't want slice-by-8.
Performance:
# single byte lookup (current)
$ gltests/bench-crc 1000000
real 7.502692
user 7.503
sys 0.000
# manual polynomial multiplication
$ gltests/bench-crc 1000000
real 6.627648
user 6.627
sys 0.000
So a 12% speedup on my machine, and would be a larger improvement with a
greater penalty for memory lookups vs. XOR/shift instructions
On Thu, 17 Oct 2024 at 09:33, Bruno Haible <[email protected]> wrote:
> Simon Josefsson wrote:
> > If we are splitting this module up into several, maybe it even make more
> > sense to have a 'crc-slice-by-8' module for that particular portable
> > optimization. I have one use-case (although not published so less
> > important) where the table size increase is a problem, and would prefer
> > to use the current smaller crc.c (or even better, a smaller one that
> > generate the tables on the fly, I forgot that it is possible).
>
> We use different modules for different functionality. (So, for example,
> CRC with a different polynomial could be a different module. Or it
> could be in the same module if most of the tables and code are the same.)
>
> For optimizations, we have other techniques available:
> - For optimizations that the package developer can enable:
> That package can invoke a particular Autoconf macro in its
> configure.ac.
> The effect of this macro is typically to initialize a configure-time
> variable.
> - For optimizations that the installer / distributor can enable:
> An --enable/--disable option, through AC_ARG_ENABLE.
> - For optimizations that the user can enable:
> An environment variable.
> - For optimizations that are enabled by default, depending on CPU
> features: A feature check at runtime, whose result is cached in a
> static variable. (/proc/cpuinfo, getauxval(AT_HWCAP), cpuid
> instruction,
> __x86_get_cpuid_feature_leaf, `ld.so --list-diagnostics`)
>
> Bruno
>
>
>
>