On Sunday 23 November 2008 21:38:07 Bill Hart wrote:
> It seems like unrolling our block2 by 2 could be made optimal in
> theory. You need 2 slots for the loop control. There are 14 slots in
> your block2.
>
> 2*14 + 2 = 30.
>
> That would give 10/4 = 2.5c/l.
>
> By the way, you suggest that perhaps moving the loop control up might
> help. If the processor has out-of-order capability, why would this
> help? Is there something else that prevents that from executing
> earlier regardless?
You assume OOO works perfectly.
mov $0,%r11
mul %rcx
add %rax,%r10
mov 24(%rsi,%rbx,8),%rax
adc %rdx,%r11
mov %r10,16(%rdi,%rbx,8)
mul %rcx
here mov $0,%r8
add %rax,%r11
mov 32(%rsi,%rbx,8),%rax
adc %rdx,%r8
mov %r11,24(%rdi,%rbx,8)
moving the line at "here" up one before the mul , slows things down from 2.78
to 3.03 c/l , whereas if OOO was perfect , it should not have any effect.
This may be due to a cpu scheduler bug , or perhaps the shedulers not
perfect , mul being long latency , two macro ops , two pipes , only pipe 0_1
etc
If its a bug then perhaps K10 is better?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"mpir-devel" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---