FWIW modern processors are all superscalar and pipelined with internal
microarchitecture. Comparisons, loop increment/decrements, and branches
may all be executed concurrently as they typically involve different
functional units.

In the trivial case of a null loop body, a compiler can eliminate the loop
altogether. Nevertheless, coding it in assembler (I'm using SPARC assembler
because its what I'm familiar with):

    ba    test
    clr   %o0       ! a = 0
test:
    cmp   %o0,%i0   ! Compare
    blt,a test      ! Branch
    add   %o0,1,%o0 ! Increment, exploit branch delay slot

This loop can execute in 1 clock cycle per trip. [In a slightly more complex
loop with 7 instructions per trip, I got performance ranging from 3 to 11
cycles due to alignment of instructions in the instruction cache.]

Real code with a significant loop body would have many opportunities to
schedule loop overhead code along with the body.

As has been pointed out, memory access patterns are far more important.  One
other class of optimisation not discussed so far is modulo-scheduling, which can
further reduce the loop overhead relative to the loop body.  Avoiding inhibitors
to modulo-scheduling can be extremely important in HPC codes.

============================================================================
   ,-_|\   Richard Smith - SE Melbourne
  /     \  Sun Microsystems Australia
[EMAIL PROTECTED]                     Phone : +61 3 9869 6200
  \_,-._/  Sun Microsystems House            Direct : +61 3 9869 6224
       v   476 St Kilda Road                    Fax : +61 3 9869 6290
           Melbourne Vic 3004 Australia
===========================================================================

===========================================================================
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff JAVA3D-INTEREST".  For general help, send email to
[EMAIL PROTECTED] and include in the body of the message "help".

Reply via email to