Go Fast: <[email protected]>: >Kato, have you compared the speed of the simulations on PS3 SPE to the speed >of the simulations on PC, given that the program is optimized for the cpu on >both sides.
I've published the comparison in "A Study on Implementing Parallel MC/UCT Algorithm" (in Japanese) http://www.geocities.jp/hideki_katoh/publications/gpw2007/gpw07-private.pdf at GPW 2007. Following is a part of Table 1 on page 4 (modified). CPU Time kpps Ratio Cell 830 us 1.2 1 x86 163 us 6.1 5.1 Cell runs about five times slower than x86 with almost the same clock (3.18 vs. 3.0 GHz), which is much slower than expected due to my not-optimized-for-SPU code, ie, the same C code was used. If I remember correctly, byte access on Cell is 3 to 7 times slower than x86 because SPU has only 16 byte load/store instructions. Watching generated code, loading a byte is simulated by: mask the lower 4 bit of the address, load 16 bytes, shift and mask the data to place the target byte at right most byte in the register. Thus, 4 instructions are needed for every byte fetch. Storing a byte is more complex: mask, shift, mask the address, load the 16 byte that includes the byte to another register, mask, merge, store-back the whole 16 bytes. I've implemented bitboard representation for 9 x 9 board for both processors, which is thought to best match SPU's 128 128-bit-wide general registers. Due to short of time, I've not compared the simulation speed but just the execution time of final_score() function using flood-fill algorithm with the general registers on SPU and SSE (128-bit wide) registers on x86. The result was almost the same (Cell was faster 5% or less). I will rewrite the MC simulator using bitboard on SPU but I have no time right now... :( Hideki >On Mon, Dec 15, 2008 at 6:40 AM, Hideki Kato <[email protected]> wrote: > >> >> Darren Cook: <[email protected]>: >> >> Advertisement: Fudo Go used a desktop pc (Intel Q9550) and _eight_ >> >> Playstation 3 consoles on a private Gigabit Ethernet LAN. >> > >> >Hello Kato-sensei, >> >> Hello Darren, >> >> BTW, I'm not a sensei (Professor) but just a doctor course student of >> 55 years old :). >> >> >Are you able to use all 8 cores of the playstation? So, with the 4 of >> >the Q9550, 68 cores altogether? Do you, or your students, have any >> >papers on the hardware challenges/solutions? >> >> Usual applications can use not 8 but 7 cores in fact because one SPU >> is used exclusively to protect the secured contents by firmware. PPU >> is not used for MC simulations but the commnunications over >> network etc. >> >> I used one core of Intel for the client (UCT tree searcher) and other >> three for internal MC simulators and 8 times 6 SPU's external. >> Thus, 51 cores are uesd for MC simulations in total. The eight PS3 >> consoles boosted Fudo Go by, perhaps, 2 or 3 stones (ranks) on 19 >> x 19. The difference of the performance between 4 and 8 PS3's is >> clear but I'm not sure all 6 SPU's are working in full duty, though >> I'll study it soon. >> >> My last paper on parallel MCTS has no description about the >> implementation for Cell BE. I'll submit longer paper in this >> month but if you want to know the detail of my implementation now, you >> can have the source code of Fudo-Go-2nd-UEC-Cup version, which is >> exactly what I used for the tournament. >> http://www.geocities.jp/hideki_katoh/release/fudo-go-2nd-uec-cup.tar.gz >> >> Hideki >> -- >> [email protected] (Kato) >> _______________________________________________ >> computer-go mailing list >> [email protected] >> http://www.computer-go.org/mailman/listinfo/computer-go/ >> >---- inline file >_______________________________________________ >computer-go mailing list >[email protected] >http://www.computer-go.org/mailman/listinfo/computer-go/ -- [email protected] (Kato) _______________________________________________ computer-go mailing list [email protected] http://www.computer-go.org/mailman/listinfo/computer-go/
