At 11:03 2003-1-9 +0100, you wrote:
I recently compared the speed of an identical algorithm (loading a JPEG) on
several platforms - PC (1.6 MHz), Win CE (about 400 MHz, I forgot now the
device type but it should not matter) and Palm m515. The results really
surprised me.

m515 run the program (approx.) 1000x slower than the PC whereas the
processor speed differs by a factor 50 only.
CE machine was only 2 times slower than the processor speed difference would
suggest.
Clock speed isn't a sole factor. You also need to look at how the CPU is using the cycles.

The Palm m515 uses a Dragonball VZ processor at 33MHz. This uses a version of the original Motorola 68000 core, originally designed in the late 1970's. On this core, a single instruction takes a average of 6 cycles, there is no real pipelining of instruction, and there is no memory cache. The Dragonball VZ gives the processor a speed of 5.4 MIPS at 33Mhz.

The Pentium 4 at 2Gz can has a rated MIPS score of around 4900. This means that on average, the Pentium 4 is executing 2 and a half instructions in a clock cycle. If you scale that back, your 1.6GHz chip can do about 3900. This is a 790x speed difference, which is in the same order of magnitude as the 1000x speed difference you saw. The rest of the diff could be due to things like memory caching and dedicated floating point units in the P4.

In other words, after linear compensation for different processor speeds CE
machine seems to be 10 times "faster" than Palm.
This is for me incredibly bad result for Palm. Naturally I tried to find
some excuses.
You're reasons may be valid, but I think it just comes down to modern processor design. JPEG decompression is very processor intensive, but lots of things we do with devices don't require CPU horsepower, and those tasks are where the Dragonballs have done very well, with very low power consumption.

The DB isn't the chip of the future -- the move to ARM shows that, as those chips can deliver considerably more instructions in a given time using less energy.

I am using CodeWarrior 8 and to my belief I used the optimal compiler
parameters. Curiously enough CW 9 (besides other problems) seems to produce
slower code than CW 8 - at least in this particular case it was so.

I tried to look at how good is the CW optimizer. It seems that it is lacking
behind its MS counterpart (factor 2-3 improvement should often be possible),
but working in assembly is the last resort for me. (Especially in case of
JPEG where the key places represent rather complicated algorithms.) Anyway,
I would be curious if gcc compiler could deliver faster code.
The CW optimizer has a few major issues for the 68K. The biggest one is that it does a poor job of register allocation. We're not investing any resources in improving the 68K performance, however -- the compiler is too old, and only Palm OS uses it. I'd expect that changes to the IR optimizer that help performance on memory-rich RISC architecture systems actually hurt the simpler execution patterns of the 68K. I'd expect GCC to do better here, but I don't know by how much.

When I was working on the CW compiler for x86 a few years ago, we were quite competitive to VC++ most of the time. On plain integer performance, we would often beat VC++, while VC++ generally did better on floating point. It was pretty amazing how tweaks in the compiler would affect things -- you could have a really phenomenal optimizer reworking the loop structures and algorithms, and then lose all the speed you gained due to poor instruction selection.

--
Ben Combee <[EMAIL PROTECTED]>
CodeWarrior for Palm OS technical lead
Palm OS programming help @ www.palmoswerks.com

--
For information on using the Palm Developer Forums, or to unsubscribe, please see http://www.palmos.com/dev/support/forums/

Reply via email to