Hi, I have come across something odd when trying to compare the relative performance of semaphores and mutex locks when 2 threads sum into a shared variable. As you'd expect, the performance of a mutex lock seems to be better when simulating on the O3 CPU using the X86 ISA, but semaphores surprisingly do much better on the TimingSimpleCPU.
I decided to try the same with the ARM ISA to see if it behaved the same way. However, ARM performs as expected - mutex locks beat semaphores in both CPU models. These are the steps I took: 1. scons -j 3 build/X86/gem5.opt PROTOCOL=MOESI_hammer 2. build/X86/gem5.opt configs/example/fs.py -n 4 --kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img --script=configs/boot/hack_back_ckpt.rcS 3. build/X86/gem5.opt configs/example/fs.py -r1 -n4 --kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img --script=mutex.rcS --ruby --l1d_size=32kB --l1i_size=32kB --l2_size=256kB --l1d_assoc=8 --l1i_assoc=8 --l2_assoc=4 --cpu-type=TimingSimpleCPU --restore-with-cpu=TimingSimpleCPU mutex.rcS: m5 resetstats ./mutex-test 1000000 m5 dumpstats m5 resetstats m5 exit Similarly for sem-test and the DerivO3CPU. The program I am testing with is quite simple: 1. The main thread spawns 2 pthreads. 2. This is the thread function: // the spawned thread // increments the 'count' variable 'rounds' times void* threadFn(void* arg) { // ensure that each thread runs on a single CPU by pinning int cpuId = (int)(long) arg; pinThread(cpuId); int i; // 'rounds' has been set as 100,000 for this test for (i = 0; i < rounds; i++) { // use locks to atomically increment 'count' pthread_mutex_lock(&mutexLock); // sem_wait(&semLock); count++; pthread_mutex_unlock(&mutexLock); // sem_post(&semLock); } return NULL; } Here are the results (sim_ticks) after execution of the above scripts: +------------------------+---------------------------+ | For 100000 rounds | sim_ticks per round | | per thread | (in Ks) | | +---------------------------+ | | mutex | sem | +-------------+----------+------------+--------------+ | X86 | O3 | 117 K | 240 K | | | Timing | 582 K | 449 K | +-------------+----------+------------+--------------+ | ARM | O3 | 149 K | 184 K | | | Timing | 370 K | 680 K | +-------------+----------+------------+--------------+ (execution for ARM was done using "--caches" instead "--ruby", but was otherwise identical to the method used for X86) These trends repeat for a larger number of 'rounds' as well. Could someone please help me understand why this might be the case..? Thanks! Sumanth Sridhar _______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users