I appologise if you see this twice.

Hi Paul and Justin,

Dipl. Ing. Paul Szawlowski wrote:

> Hi Justin,
>
>> A simple example: Why is this loop
>>
>> for(int i = count; --i >= 0 ; )
>>
>> faster than this loop
>>
>> for(int i = 0; i < count - 1; i++ )
>>
>> given exactly the same loop content?
>
>
I would have to disagree with you on that Justin. But only because
the first loop runs count times and the second runs count-1 times.
So I'll assume you meant:

for (int i = 0; i < count; i++)

Old hackers makes mistakes too, ehh [:-)]

>
> Ok, here is my coming out. I do programming now but don't have a solid
> programmer's education background. So I can't tell you why the first
> loop is
> faster. But I really would like to know more about what's going on
> under the
> hood. Could you please explain to me



The reason for this is the way your program is changed into simple
instructions for your CPU. I will illustrate this with some
simplified maschine code examples. No CPU has exactly these
instructions, but they all look a bit like this. It should be
quite simle to follow, even if you have never programmed at
machine level.

Lets first look at the "slow" example:

for (int i = 0; i < count; i++)

A direct translation into a simplified CPU code would look like this:

LD  A,0     - load 0 into register A
:::loop     - label marking the start of the loop (we jump to here)
CMP A,count - compare A to count
BGE done    - branch (on greater og equal) to label done
.
. [code inside the loop]
.
INC A       - increment register A by 1
JMP loop    - jump to label loop (round again)
:::done     - label marking the code after the loop

Each time round the loop we do:

compare, branch, ... increment, jump

The key to the trick is the way the compare (CMP) and branch (BGE)
instructions are tied together. The branch looks at some flags inside
the CPU to know if it should jump. The CMP instruction sets these
flags depending on the result of the comparison.

Most instructions sets the CPU flags when "something special happens"
like a result is zero or a walue changed sign (+/-).

Each time around the loop we increment the value value of A, but
nothing special happens when we reach count, so we have to compare
A to count to see if we are done.

The trick is to arrange it so that "something special happens" when
we are done. To do this we count down to zero in stead:

for (int i = count; --i >= 0; )

Now when the value of "i" (register A) changes sign to become -1
we should  branch out of the loop. Changing sign is
"something special" and sets a flag in the CPU. How nice!
We don't need the compare instruction to set the flags anymore.

LD  A,count - load count into register A
:::loop     - label marking the start of the loop
DEC A       - decrement register A by 1 (also sets flags)
BNE done    - branch (if the result was negative) to label done
.
. [code inside the loop]
.
JMP loop    - jump to label loop (round again)
:::done:

Each time round the loop we do:

decrement, branch, ... jump

We have saved a compare each time round the loop.

>
> The only thing what I can see is the count - 1 statement. Is it
> calculated
> after every loop step ? Will the compiler only optimize this if count is
> declared const(final) ?
>
> Is this equivalent to the first loop ?
> const int tempCount = count - 1;
> for( int i = 0; i < tempCount; i ++)



You can trust your compiler to do that for you.

Any modern compiler will find instructions that can be moved
outside the loop and will generaly do so "behind your back".

In fact, almost all of the old hacker tricks are performed by
optimizing compilers now, and you will see no change in
performance (or even in the maschine code) if you use them.

Justins example is still valid because changing the direction
of the count i often not a simple optimization. The compiler
must be sure it does not change the result of the program.
Some optimizing compilers **will** actually change the code
to count down in stead of up if "i" is not used in the loop.

In the old days of simpler CPU's the direction of loops could
have a big impact on a programs running time. This is still
true for signal processors. For the PC class system of today,
it is not important unless you are writing low level libraries.
If you do that kind of optimizations. Run tests!
Brach prediction and the like can hold a few surprises.

My advice is to:

* Write clear and readable code
* Use a good optimizing compiler
* Do code opmizations by hand in your spare time (for fun)

The kind of optimization that you realy **must** considder
is how you access memory. In a PC class system you have
a very fast cache and some very slow memory. There can be
an order of mangnitudes difference in running time between
"the right way" and "the wrong way" of using the cache.

/Bo Bai


Reply via email to