Hi,

I have come across something odd when trying to compare the relative
performance of semaphores and mutex locks when 2 threads sum into a
shared variable. As you'd expect, the performance of a mutex lock
seems to be better when simulating on the O3 CPU using the X86 ISA,
but semaphores surprisingly do much better on the TimingSimpleCPU.

I decided to try the same with the ARM ISA to see if it behaved the
same way. However, ARM performs as expected - mutex locks beat
semaphores in both CPU models.

These are the steps I took:

1. scons -j 3 build/X86/gem5.opt PROTOCOL=MOESI_hammer
2. build/X86/gem5.opt configs/example/fs.py -n 4
--kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img
--script=configs/boot/hack_back_ckpt.rcS
3. build/X86/gem5.opt configs/example/fs.py -r1 -n4
--kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img
--script=mutex.rcS --ruby --l1d_size=32kB --l1i_size=32kB
--l2_size=256kB --l1d_assoc=8 --l1i_assoc=8 --l2_assoc=4
--cpu-type=TimingSimpleCPU --restore-with-cpu=TimingSimpleCPU

mutex.rcS:
m5 resetstats
./mutex-test 1000000
m5 dumpstats

m5 resetstats
m5 exit

Similarly for sem-test and the DerivO3CPU.

The program I am testing with is quite simple:
1. The main thread spawns 2 pthreads.
2. This is the thread function:

// the spawned thread
// increments the 'count' variable 'rounds' times
void* threadFn(void* arg)
{
    // ensure that each thread runs on a single CPU by pinning
    int cpuId = (int)(long) arg;
    pinThread(cpuId);

    int i;
    // 'rounds' has been set as 100,000 for this test
    for (i = 0; i < rounds; i++)
    {
        // use locks to atomically increment 'count'
        pthread_mutex_lock(&mutexLock);
        // sem_wait(&semLock);
        count++;
        pthread_mutex_unlock(&mutexLock);
        // sem_post(&semLock);
    }
    return NULL;
}

Here are the results (sim_ticks) after execution of the above scripts:

+------------------------+---------------------------+
|    For 100000 rounds   |    sim_ticks per round    |
|       per thread       |          (in Ks)          |
|                        +---------------------------+
|                        |   mutex    |     sem      |
+-------------+----------+------------+--------------+
| X86         | O3       |   117 K    |     240 K    |
|             | Timing   |   582 K    |     449 K    |
+-------------+----------+------------+--------------+
| ARM         | O3       |   149 K    |     184 K    |
|             | Timing   |   370 K    |     680 K    |
+-------------+----------+------------+--------------+
(execution for ARM was done using "--caches" instead "--ruby", but
was otherwise identical to the method used for X86)

These trends repeat for a larger number of 'rounds' as well.

Could someone please help me understand why this might be the case..?

Thanks!
Sumanth Sridhar
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to