Hi, I am seeing a strange performance behavior with pthread mutex operations on my Solaris (4-core) system running S10U5.
The issue is that if the program is run with truss enabled, I am getting about 3 times more the performance of the run without truss enabled. Strange...right? The C program does the following: - accept 2 parameters [number of concurrent threads(N), and run-time] - create and run N threads simultaneously for <run-time> seconds - each thread does the following: - acquires the pthread mutex - increases the value of it's _own_ long variable by 1 - releases the pthread mutex - after run-time expires, each thread reports the value of it's long variable Note that I did not use any special flags to compile it. Reply back to this email if you like to take a look at the program source code. Let's look at the program without truss enabled. $ mutex_ops -w4 -r10 number of threads:4, run-time (in seconds)=10 Thread 1: total ops: 4580775, ops/second: 458077 Thread 3: total ops: 3798001, ops/second: 379800 Thread 0: total ops: 4867850, ops/second: 486785 Thread 2: total ops: 4925299, ops/second: 492529 $ <<vmstat 1>> kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id 0 0 0 17121968 14961748 22 205 0 0 0 0 0 0 0 0 0 7960 16218 16155 28 2 71 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19914 41572 41568 81 3 16 1 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 18711 39086 38937 80 3 17 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19006 39373 39382 80 3 17 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 18719 39151 39007 80 3 16 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19560 41243 40963 81 4 16 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 20498 43671 43203 81 4 16 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 20857 44636 44014 80 4 15 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19149 41239 40583 80 4 16 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 21019 44189 43995 81 4 15 0 0 0 17122372 14962176 7 29 0 0 0 0 0 0 0 0 0 13062 27003 26973 46 2 52 0 0 0 17122372 14962176 0 0 0 0 0 0 0 0 0 0 0 461 160 274 0 0 100 As you can see above that I am executing the program with 4 concurrent threads for 10 seconds. The output shows that each thread has acquired the mutex lock about 400,000 times a second. The vmstat shows that the program used about 80% of usertime, 4%system and ~15% idle. So far so good. Now, let's look into program with truss case. $ truss -fc mutex_ops -w4 -r10 number of threads:4, run-time (in seconds)=10 Thread 0: total ops: 13419838, ops/second: 1341983 Thread 2: total ops: 13959115, ops/second: 1395911 Thread 1: total ops: 13414291, ops/second: 1341429 Thread 3: total ops: 13535670, ops/second: 1353567 syscall seconds calls errors _exit .000 1 write .000 5 open .000 3 1 close .000 2 stat .000 10 7 getpid .000 1 fstat .000 1 ioctl .000 1 execve .000 1 lwp_park .905 106423 lwp_unpark .757 106423 sigaction .000 1 getcontext .000 1 setustack .000 5 mmap .000 16 munmap .000 2 getrlimit .000 1 memcntl .000 1 sysconfig .000 2 yield .182 26004 lwp_create .000 4 lwp_exit .000 4 lwp_continue .000 4 lwp_sigmask .000 4 lwp_private .000 1 lwp_wait .000 4 lwp_mutex_wakeup .000 9 nanosleep .000 1 schedctl .000 5 resolvepath .000 4 lwp_mutex_timedlock .000 32 getcwd .000 1 -------- ------ ---- sys totals: 1.846 238977 8 usr time: 21.363 elapsed: 10.060 $ <<vmstat 1>> kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id 0 0 0 17122372 14962176 0 0 0 0 0 0 0 0 0 0 0 529 262 335 0 0 100 0 0 0 17121240 14960660 153 632 0 0 0 0 0 0 0 0 0 25259 91754 87992 37 15 48 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 62751 178228 179102 57 33 10 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 56922 169565 168507 59 30 11 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 60935 178967 178535 58 32 10 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 61581 180239 179751 58 32 10 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 61486 181664 180513 58 33 10 1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 67941 186205 190703 55 35 10 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 59981 175708 175758 58 32 10 1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 60485 174557 175091 58 32 11 1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 62799 178328 179880 57 32 10 0 0 0 17122236 14962040 8 17 0 0 0 0 0 0 0 0 0 19483 58122 58317 19 10 71 0 0 0 17122236 14962040 0 0 0 0 0 0 0 0 0 0 0 449 160 279 0 0 100 Can you spot the issue? I used the same parameters(4 concurrent threads, 10 seconds run-time), but this time *with truss enabled*. The output shows that each thread has acquired the mutex lock about 1,300,000 times a second. vmstat shows that the program is using a lot more system time than before. However, I don't know how truss can cause the system time to go up, which ultimately increases the performance. Am I missing something? BTW, I confirmed the mutex locks generated by the program using dtrace. The values are indeed correct for both the cases. Any inputs/comments are appreciated. thanks. senthil _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org