Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

Doug Ledford Wed, 30 Oct 2013 06:23:32 -0700

On 10/30/2013 08:18 AM, David Laight wrote:

/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1


Those have to be sequenced.

Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active to beat a 64bit adc.

        David

Already done (well, something similar to what you mention above anyway),doesn't help (although doesn't hurt either, even though it doubles thenumber of adds needed to complete the same work). This is the code Itested:


#define ADDL_64                                         \
        asm("xorq  %%r8,%%r8\n\t"                       \
            "xorq  %%r9,%%r9\n\t"                       \
            "xorq  %%r10,%%r10\n\t"                     \
            "xorq  %%r11,%%r11\n\t"                     \
            "movl  0*4(%[src]),%%r8d\n\t"               \
            "movl  1*4(%[src]),%%r9d\n\t"               \
            "movl  2*4(%[src]),%%r10d\n\t"              \
            "movl  3*4(%[src]),%%r11d\n\t"              \
            "addq  %%r8,%[res1]\n\t"                    \
            "addq  %%r9,%[res2]\n\t"                    \
            "addq  %%r10,%[res3]\n\t"                   \
            "addq  %%r11,%[res4]\n\t"                   \
            "movl  4*4(%[src]),%%r8d\n\t"               \
            "movl  5*4(%[src]),%%r9d\n\t"               \
            "movl  6*4(%[src]),%%r10d\n\t"              \
            "movl  7*4(%[src]),%%r11d\n\t"              \
            "addq  %%r8,%[res1]\n\t"                    \
            "addq  %%r9,%[res2]\n\t"                    \
            "addq  %%r10,%[res3]\n\t"                   \
            "addq  %%r11,%[res4]\n\t"                   \
            "movl  8*4(%[src]),%%r8d\n\t"               \
            "movl  9*4(%[src]),%%r9d\n\t"               \
            "movl  10*4(%[src]),%%r10d\n\t"             \
            "movl  11*4(%[src]),%%r11d\n\t"             \
            "addq  %%r8,%[res1]\n\t"                    \
            "addq  %%r9,%[res2]\n\t"                    \
            "addq  %%r10,%[res3]\n\t"                   \
            "addq  %%r11,%[res4]\n\t"                   \
            "movl  12*4(%[src]),%%r8d\n\t"              \
            "movl  13*4(%[src]),%%r9d\n\t"              \
            "movl  14*4(%[src]),%%r10d\n\t"             \
            "movl  15*4(%[src]),%%r11d\n\t"             \
            "addq  %%r8,%[res1]\n\t"                    \
            "addq  %%r9,%[res2]\n\t"                    \
            "addq  %%r10,%[res3]\n\t"                   \
            "addq  %%r11,%[res4]"                       \
            : [res1] "=r" (result1),                    \
              [res2] "=r" (result2),                    \
              [res3] "=r" (result3),                    \
              [res4] "=r" (result4)                     \
            : [src] "r" (buff),                         \
              "[res1]" (result1), "[res2]" (result2),   \
              "[res3]" (result3), "[res4]" (result4)    \
            : "r8", "r9", "r10", "r11" )

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

Reply via email to