Hi James,

On 03.10.2017 08:38, Marcin Nowakowski wrote:


The need for 64-bit signed length is unfortunate. Do you get decent
assembly and comparable/better performance on 32-bit if you just use len
and only decrement it in the loops? i.e.

-       while ((length -= sizeof(uXX)) >= 0) {
+       while (len >= sizeof(uXX)) {
                 register uXX value = get_unaligned_leXX(p);

                 CRC32(crc, value, XX);
                 p += sizeof(uXX);
+               len -= sizeof(uXX);
         }

That would be more readable too IMHO.

or maybe just do some pointer arithmetic like

   const u8 *end = p + len;

   while ((end - p) >= sizeof(uXX)) {
            register uXX value = get_unaligned_leXX(p);

            CRC32(crc, value, XX);
            p += sizeof(uXX);
   }

Thank you both for these suggestions. All solutions are very similar in terms of the assembly produced, although the original code is the smallest of all:

original vs James':
crc32_mips_le_hw                             104     132     +28
vermagic                                      72      78      +6
chksumc_finup                                 40      44      +4
chksumc_digest                                44      48      +4
chksum_finup                                  92      96      +4
chksum_digest                                100     104      +4

original vs Jonas':
add/remove: 0/0 grow/shrink: 7/0 up/down: 90/0 (90)
function                                     old     new   delta
crc32_mips_le_hw                             104     148     +44
vermagic                                      72      78      +6
chksumc_finup                                 40      44      +4
chksumc_digest                                44      48      +4
chksum_finup                                  92      96      +4
chksum_digest                                100     104      +4


However - the key thing which is the processing loop is 6 instructions long in all variants. It's only the pre/post loop processing that adds the extra instructions so all these solutions should be roughly equal in terms of performance. I find James' code a bit more readable so I'll go with it and post an updated patch.


The comparisons above were for 64-bit, where the difference is negligible. On 32-bit builds, however, the difference is more significant:

original vs James':

function                                     old     new   delta
vermagic                                      80      86      +6
crc32c_mips_le_hw                            144     104     -40
crc32_mips_le_hw                             144     104     -40

and the main crc loop is down from 9 to 5 instructions, so it's a significant reduction of the loop size.

Marcin

Reply via email to