On Wed, Sep 15, 2010 at 07:41:59PM +0200, Thomas Pornin wrote: > First thanks to all who responded.
Thanks again, and sorry for all the noise. I had external confirmation (from someone with access to other Linux-powered PS3) that gcc38 runs as good as a PS3 should. A few details: ** I read the wrong documentation earlier: I was using the 603e manual, and not the G3 manual. The G3 is more powerful than the 603e, and can schedule more instructions per clock cycle. ** As for instruction timing, the Cell PPU is somewhat similar to a 603e core, except for the microcoded instructions and the multiply/divide opcodes (which are slower in the PPU). For all the integer operations used in a SHA-256 core (additions, rotations, shifts, bitwise operators), the PPU can execute, under ideal conditions, one operation per clock cycle, and has a latency of 2 cycles (if the next instruction uses the result, one cycle is lost). Also, the PPU can perform one load or one store per cycle, concurrently with the integer operations (provided that there are no dependencies). ** A SHA-256 core processes 64 data bytes and uses 2168 integer operations. Assuming that it is possible to execute them all without being stalled due to dependencies between opcodes, this implies a theoretical minimum of 33.9 cycles per byte. My C code runs at 52 cycles per byte; the difference being shared between the approximations made above (the stall-less execution is not necessarily possible), the shortcomings of the GCC code scheduler, and the use of inline constants (there are 64 constants in SHA-256; my C code inlines them, and the produced code uses integer opcodes to load them, two opcodes per constant; that's an extra 2 cycles per byte). My C code on the G3 runs at 21.4 cycles per byte (really measured), which is substantially better than the theoretical limit of the PPU. ** Using '-mcpu=cell -mtune=cell' with GCC-4.4.4 changes the emitted opcodes (so this has an effect) but does not notably alter the performance of my SHA-256 implementation. These options are not documented (i.e. the part of the GCC manual which describes -mcpu and -mtune for PowerPC systems does not list 'cell' as a valid value). In brief the PPU is not good at integer calculations. At least when not using AltiVec. The AltiVec unit _can_ perform more 32-bit integer operations per cycle than a G3, but there are constraints (parallelism inherent to the SIMD nature of AltiVec, and higher latency) which limit applicability to some algorithms (e.g. SHA-256), and the C compiler does not naturally use them (automatic vectorization is hard). Does anyone know how the Xenon fares ? It contains three cores which are rumoured to be "very similar to the Cell PPU", and since the Xenon has no SPU, its performance relies entirely on those cores. --Thomas Pornin _______________________________________________ Gcc-cfarm-users mailing list Gcc-cfarm-users@gna.org https://mail.gna.org/listinfo/gcc-cfarm-users