Hi,
I have come across something odd when trying to compare the relative
performance of semaphores and mutex locks when 2 threads sum into a
shared variable. As you'd expect, the performance of a mutex lock
seems to be better when simulating on the O3 CPU using the X86 ISA,
but semaphores surprisingly do much better on the TimingSimpleCPU.
I decided to try the same with the ARM ISA to see if it behaved the
same way. However, ARM performs as expected - mutex locks beat
semaphores in both CPU models.
These are the steps I took:
1. scons -j 3 build/X86/gem5.opt PROTOCOL=MOESI_hammer
2. build/X86/gem5.opt configs/example/fs.py -n 4
--kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img
--script=configs/boot/hack_back_ckpt.rcS
3. build/X86/gem5.opt configs/example/fs.py -r1 -n4
--kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img
--script=mutex.rcS --ruby --l1d_size=32kB --l1i_size=32kB
--l2_size=256kB --l1d_assoc=8 --l1i_assoc=8 --l2_assoc=4
--cpu-type=TimingSimpleCPU --restore-with-cpu=TimingSimpleCPU
mutex.rcS:
m5 resetstats
./mutex-test 1000000
m5 dumpstats
m5 resetstats
m5 exit
Similarly for sem-test and the DerivO3CPU.
The program I am testing with is quite simple:
1. The main thread spawns 2 pthreads.
2. This is the thread function:
// the spawned thread
// increments the 'count' variable 'rounds' times
void* threadFn(void* arg)
{
// ensure that each thread runs on a single CPU by pinning
int cpuId = (int)(long) arg;
pinThread(cpuId);
int i;
// 'rounds' has been set as 100,000 for this test
for (i = 0; i < rounds; i++)
{
// use locks to atomically increment 'count'
pthread_mutex_lock(&mutexLock);
// sem_wait(&semLock);
count++;
pthread_mutex_unlock(&mutexLock);
// sem_post(&semLock);
}
return NULL;
}
Here are the results (sim_ticks) after execution of the above scripts:
+------------------------+---------------------------+
| For 100000 rounds | sim_ticks per round |
| per thread | (in Ks) |
| +---------------------------+
| | mutex | sem |
+-------------+----------+------------+--------------+
| X86 | O3 | 117 K | 240 K |
| | Timing | 582 K | 449 K |
+-------------+----------+------------+--------------+
| ARM | O3 | 149 K | 184 K |
| | Timing | 370 K | 680 K |
+-------------+----------+------------+--------------+
(execution for ARM was done using "--caches" instead "--ruby", but
was otherwise identical to the method used for X86)
These trends repeat for a larger number of 'rounds' as well.
Could someone please help me understand why this might be the case..?
Thanks!
Sumanth Sridhar
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users