From: Tom Herbert
> Sent: 02 March 2016 22:19
...
> +     /* Main loop using 64byte blocks */
> +     for (; len > 64; len -= 64, buff += 64) {
> +             asm("addq 0*8(%[src]),%[res]\n\t"
> +                 "adcq 1*8(%[src]),%[res]\n\t"
> +                 "adcq 2*8(%[src]),%[res]\n\t"
> +                 "adcq 3*8(%[src]),%[res]\n\t"
> +                 "adcq 4*8(%[src]),%[res]\n\t"
> +                 "adcq 5*8(%[src]),%[res]\n\t"
> +                 "adcq 6*8(%[src]),%[res]\n\t"
> +                 "adcq 7*8(%[src]),%[res]\n\t"
> +                 "adcq $0,%[res]"
> +                 : [res] "=r" (result)
> +                 : [src] "r" (buff),
> +                 "[res]" (result));

Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
without any unrolling?

...
> +     /* Sum over any remaining bytes (< 8 of them) */
> +     if (len & 0x7) {
> +             unsigned long val;
> +             /*
> +              * Since "len" is > 8 here we backtrack in the buffer to load
> +              * the outstanding bytes into the low order bytes of a quad and
> +              * then shift to extract the relevant bytes. By doing this we
> +              * avoid additional calls to load_unaligned_zeropad.

That comment is wrong. Maybe:
                 * Read the last 8 bytes of the buffer then shift to extract
                 * the required bytes.
                 * This is safe because the original length was > 8 and avoids
                 * any problems reading beyond the end of the valid data.

        David

Reply via email to