I have benchmarked gnubg on two server machines, with particular focus on multithreading. Both Machines are headless and run Debian 5.x Lenny, Kernel 2.6.26-2-amd64 #1 SMP x86_64 GNU/Linux. The hardware is:
box_A: 2xXeon 5130 @ 2GHz (4 physical cores in 2 chips) box_B: 2xXeon Nocona @ 3GHz (2 physical cores plus 2 HT "cores" in 2 chips) I found two issues with current gnubg (latest CVS version as of August 1st 2009, compiled with gcc 4.3.2.1 with -march=native and sse2 support): 1) The "calibrate" command output is off by a factor of 1000, i.e. reports eval/s values 1000 times too high. This holds for the figure reported in the official Debian binary installed via apt-get. 2) The limit of 16 threads is too low, I found that to utilize the CPU power to 100% 8 threads per core are needed. Interestingly this holds for the virtual HT cores as well. @1: Please check the timer code, the problem seems to be in timer.c. Obviously the #ifdef part for Windows is fine, but all other machines use a faulty version of the timer. I can't really suggest a solution, but here is some background info from wikipedia: http://en.wikipedia.org/wiki/Rdtsc I would help to fix this one by testing on the beforementioned machines under 64 bit Linux. @2: I've tested with a custom gnubg binary with the bug at @1 fixed the hard way by dividing by 1000 hardcodedly and thread limit raised to 256. While calibrate was running I've monitored CPU utilization usiing the mpstat command. box_A peaks at about 202K eval/s with 8 threads per core (32 total), from where on the number is static until it starts decreasing again when you use hundreds of threads. between 1 and 3 threads I see the expected gain of almost 100% per thread added. Using 4 threads is lowering the throughput as compared to 3 threads. Between 5 and 32 threads I see rising throughput which first is linear, and becomes asymptotic as we get closer to 32 threads. Below 32 threads, mpstat reports significant idle times for each CPU, at 32 I see each is using 100% of the cycles. A very similar behavior is visible on box_B, despite the fact 2 of its "cores" are virtual HT cores. Extrapolating the results suggests gnubg should increase the limit for the number of max. threads to 64, maybe even 128 or 256. Rationale: recent server hardware with dual quadcores has 8 cores, which should be fully utilizeable only with 64 threads. The suggested 128 anticipates future improvements. As there seems to be little to no cost with higher values for max. threads, this seems like a cheap way to speed up gnubg on server class machines and quad cores at little to no cost. Cheers, Ingo _______________________________________________ Bug-gnubg mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-gnubg
