On Sunday 23 November 2008 22:49:21 Jason Martin wrote:
> > You assume OOO works perfectly.
> >
> > mov $0,%r11
> > mul %rcx
> > add %rax,%r10
> > mov 24(%rsi,%rbx,8),%rax
> > adc %rdx,%r11
> > mov %r10,16(%rdi,%rbx,8)
> > mul %rcx
> > here mov $0,%r8
> > add %rax,%r11
> > mov 32(%rsi,%rbx,8),%rax
> > adc %rdx,%r8
> > mov %r11,24(%rdi,%rbx,8)
> >
> > moving the line at "here" up one before the mul , slows things down from
> > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any
> > effect. This may be due to a cpu scheduler bug , or perhaps the shedulers
> > not perfect , mul being long latency , two macro ops , two pipes , only
> > pipe 0_1 etc
> > If its a bug then perhaps K10 is better?
>
> I've seen similar wackiness with the core 2 out-of-order engine. It's
> strange enough that sometimes sticking in a nop actually saves a
> cycle!
another oddity..
loop:
mov (%rdi),%rcx
adc %rcx,%rcx
mov %rcx,(%rdi)
... 8 way unrolled lshift by 1
mov 56(%rdi),%r9
adc %r9,%r9
mov %r9,56(%rdi)
lea 64(%rdi),%rdi
dec %rsi
jnz loop
runs at 1.11c/l
whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to
bunch them up into 4's to get to 1.11c/l
mov (%rdi),%rcx
mov -8(%rdi),%r8
mov -16(%rdi),%r9
mov -24(%rdi),%r10
rcr $1,%rcx
rcr $1,%r8
rcr $1,%r9
rcr $1,%r10
mov %rcx,(%rdi)
mov %r8,-8(%rdi)
mov %r9,-16(%rdi)
mov %r10,-24(%rdi)
mov -32(%rdi),%rcx
mov -40(%rdi),%r8
mov -48(%rdi),%r9
mov -56(%rdi),%r10
rcr $1,%rcx
rcr $1,%r8
rcr $1,%r9
rcr $1,%r10
mov %rcx,-32(%rdi)
mov %r8,-40(%rdi)
mov %r9,-48(%rdi)
mov %r10,-56(%rdi)
lea -64(%rdi),%rdi
dec %rsi
jnz loop
Again , it looks like the OOO is broken.
But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO has
to get work from three separate iterations to fill out the slots.
While I'm at it , I got some more complaints :)
timing mpn_add/sub_n with the gmp speed program the results stay fairly
consistent . You may get say 24.5 cycles in one run and 24.6 in another. Ok ,
occasionally you 200 cycles , but I assume thats an interupt or some such
thing. But , for my mpn_com_n , which is mind numbingly simple
(mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles .... . Whats
going on there! , I dont know.
Confused.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---