On Wed, Sep 15, 2010 at 07:41:59PM +0200, Thomas Pornin wrote:
> First thanks to all who responded.

Thanks again, and sorry for all the noise.

I had external confirmation (from someone with access to other
Linux-powered PS3) that gcc38 runs as good as a PS3 should.

A few details:

** I read the wrong documentation earlier: I was using the 603e
manual, and not the G3 manual. The G3 is more powerful than the
603e, and can schedule more instructions per clock cycle.

** As for instruction timing, the Cell PPU is somewhat similar to a 603e
core, except for the microcoded instructions and the multiply/divide
opcodes (which are slower in the PPU). For all the integer operations
used in a SHA-256 core (additions, rotations, shifts, bitwise
operators), the PPU can execute, under ideal conditions, one operation
per clock cycle, and has a latency of 2 cycles (if the next instruction
uses the result, one cycle is lost). Also, the PPU can perform one load
or one store per cycle, concurrently with the integer operations
(provided that there are no dependencies).

** A SHA-256 core processes 64 data bytes and uses 2168 integer
operations. Assuming that it is possible to execute them all without
being stalled due to dependencies between opcodes, this implies a
theoretical minimum of 33.9 cycles per byte. My C code runs at 52 cycles
per byte; the difference being shared between the approximations made
above (the stall-less execution is not necessarily possible), the
shortcomings of the GCC code scheduler, and the use of inline constants
(there are 64 constants in SHA-256; my C code inlines them, and the
produced code uses integer opcodes to load them, two opcodes per
constant; that's an extra 2 cycles per byte). My C code on the G3 runs
at 21.4 cycles per byte (really measured), which is substantially better
than the theoretical limit of the PPU.

** Using '-mcpu=cell -mtune=cell' with GCC-4.4.4 changes the emitted
opcodes (so this has an effect) but does not notably alter the
performance of my SHA-256 implementation. These options are not
documented (i.e. the part of the GCC manual which describes -mcpu and
-mtune for PowerPC systems does not list 'cell' as a valid value).


In brief the PPU is not good at integer calculations. At least when not
using AltiVec. The AltiVec unit _can_ perform more 32-bit integer
operations per cycle than a G3, but there are constraints (parallelism
inherent to the SIMD nature of AltiVec, and higher latency) which limit
applicability to some algorithms (e.g. SHA-256), and the C compiler does
not naturally use them (automatic vectorization is hard).

Does anyone know how the Xenon fares ? It contains three cores which are
rumoured to be "very similar to the Cell PPU", and since the Xenon has
no SPU, its performance relies entirely on those cores.


        --Thomas Pornin

_______________________________________________
Gcc-cfarm-users mailing list
Gcc-cfarm-users@gna.org
https://mail.gna.org/listinfo/gcc-cfarm-users

Reply via email to