From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +0000, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > > 'adcq    0*8(%rdi),%rax' is using 3 bytes instead.
> > >
> > > We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> > > instruction and remove the nop.
> > >
> > >
> > > +       lea     20f(, %rcx, 4), %r11
> > > +       clc
> > > +       jmp     *%r11
> > > +
> > > +.align 8
> > > +       adcq    6*8(%rdi),%rax
> > > +       adcq    5*8(%rdi),%rax
> > > +       adcq    4*8(%rdi),%rax
> > > +       adcq    3*8(%rdi),%rax
> > > +       adcq    2*8(%rdi),%rax
> > > +       adcq    1*8(%rdi),%rax
> > > +       adcq    0*8(%rdi),%rax // could force a 4 byte instruction (.byte 
> > > 0x48, 0x13, 0x47, 0x00)
> > > +       nop
> > > +20:    /* #quads % 8 jump table base */
> >
> > Or move label at the top (after the .align) and adjust the maths.
> > You could add a second label after the first adcq and use the
> > difference between them for the '4'.
> 
> Not really.
> 
> We could not use the trick it the length was 5.
> 
> Only 1, 2, 4 and 8 are supported.

Indeed, and 'lea  20f(, %rcx, 5), %r11' will generate an error from the
assembler.
Seems appropriate to get the assembler to verify this for you.

Assuming this code block is completely skipped for aligned lengths
the nop isn't needed provided the '20:' label is at the right place.

Someone also pointed out that the code is memory limited (dual add
chains making no difference), so why is it unrolled at all?

OTOH I'm sure I remember something about false dependencies on the
x86 flags register because of instructions only changing some bits.
So it might be that you can't (or couldn't) get concurrency between
instructions that update different parts of the flags register.

        David


Reply via email to