Could you please provide lockstat information from plockstat for each of these cases. I think I know what you are seeing, but I want data to back up the theory.
Thanks Dave Valin senthil ramanujam wrote: > Hi, > > I am seeing a strange performance behavior with pthread mutex > operations on my Solaris (4-core) system running S10U5. > > The issue is that if the program is run with truss enabled, I am > getting about 3 times more the performance of the run without truss > enabled. Strange...right? > > The C program does the following: > - accept 2 parameters [number of concurrent threads(N), and run-time] > - create and run N threads simultaneously for <run-time> seconds > - each thread does the following: > - acquires the pthread mutex > - increases the value of it's _own_ long variable by 1 > - releases the pthread mutex > - after run-time expires, each thread reports the value of it's long > variable > > Note that I did not use any special flags to compile it. Reply back to > this email if you like to take a look at the program source code. > > Let's look at the program without truss enabled. > > $ mutex_ops -w4 -r10 > number of threads:4, run-time (in seconds)=10 > Thread 1: total ops: 4580775, ops/second: 458077 > Thread 3: total ops: 3798001, ops/second: 379800 > Thread 0: total ops: 4867850, ops/second: 486785 > Thread 2: total ops: 4925299, ops/second: 492529 > $ > <<vmstat 1>> > kthr memory page disk faults cpu > r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id > 0 0 0 17121968 14961748 22 205 0 0 0 0 0 0 0 0 0 7960 16218 16155 28 2 > 71 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19914 41572 41568 81 3 > 16 > 1 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 18711 39086 38937 80 3 > 17 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19006 39373 39382 80 3 > 17 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 18719 39151 39007 80 3 > 16 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19560 41243 40963 81 4 > 16 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 20498 43671 43203 81 4 > 16 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 20857 44636 44014 80 4 > 15 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19149 41239 40583 80 4 > 16 > 0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 21019 44189 43995 81 4 > 15 > 0 0 0 17122372 14962176 7 29 0 0 0 0 0 0 0 0 0 13062 27003 26973 46 2 > 52 > 0 0 0 17122372 14962176 0 0 0 0 0 0 0 0 0 0 0 461 160 274 0 0 > 100 > > As you can see above that I am executing the program with 4 concurrent > threads for 10 seconds. The output shows that each thread has acquired > the mutex lock about 400,000 times a second. The vmstat shows that the > program used about 80% of usertime, 4%system and ~15% idle. So far so > good. > > Now, let's look into program with truss case. > > $ truss -fc mutex_ops -w4 -r10 > number of threads:4, run-time (in seconds)=10 > Thread 0: total ops: 13419838, ops/second: 1341983 > Thread 2: total ops: 13959115, ops/second: 1395911 > Thread 1: total ops: 13414291, ops/second: 1341429 > Thread 3: total ops: 13535670, ops/second: 1353567 > > syscall seconds calls errors > _exit .000 1 > write .000 5 > open .000 3 1 > close .000 2 > stat .000 10 7 > getpid .000 1 > fstat .000 1 > ioctl .000 1 > execve .000 1 > lwp_park .905 106423 > lwp_unpark .757 106423 > sigaction .000 1 > getcontext .000 1 > setustack .000 5 > mmap .000 16 > munmap .000 2 > getrlimit .000 1 > memcntl .000 1 > sysconfig .000 2 > yield .182 26004 > lwp_create .000 4 > lwp_exit .000 4 > lwp_continue .000 4 > lwp_sigmask .000 4 > lwp_private .000 1 > lwp_wait .000 4 > lwp_mutex_wakeup .000 9 > nanosleep .000 1 > schedctl .000 5 > resolvepath .000 4 > lwp_mutex_timedlock .000 32 > getcwd .000 1 > -------- ------ ---- > sys totals: 1.846 238977 8 > usr time: 21.363 > elapsed: 10.060 > $ > > <<vmstat 1>> > kthr memory page disk faults cpu > r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id > 0 0 0 17122372 14962176 0 0 0 0 0 0 0 0 0 0 0 529 262 335 0 0 > 100 > 0 0 0 17121240 14960660 153 632 0 0 0 0 0 0 0 0 0 25259 91754 > 87992 37 15 48 > 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 62751 178228 > 179102 57 33 10 > 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 56922 169565 > 168507 59 30 11 > 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 60935 178967 > 178535 58 32 10 > 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 61581 180239 > 179751 58 32 10 > 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 61486 181664 > 180513 58 33 10 > 1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 67941 186205 > 190703 55 35 10 > 0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 59981 175708 > 175758 58 32 10 > 1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 60485 174557 > 175091 58 32 11 > 1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 62799 178328 > 179880 57 32 10 > 0 0 0 17122236 14962040 8 17 0 0 0 0 0 0 0 0 0 19483 58122 > 58317 19 10 71 > 0 0 0 17122236 14962040 0 0 0 0 0 0 0 0 0 0 0 449 160 279 0 0 > 100 > > Can you spot the issue? I used the same parameters(4 concurrent > threads, 10 seconds run-time), but this time *with truss enabled*. The > output shows that each thread has acquired the mutex lock about > 1,300,000 times a second. vmstat shows that the program is using a lot > more system time than before. However, I don't know how truss can > cause the system time to go up, which ultimately increases the > performance. Am I missing something? > > BTW, I confirmed the mutex locks generated by the program using > dtrace. The values are indeed correct for both the cases. > > Any inputs/comments are appreciated. > > thanks. > > senthil > _______________________________________________ > perf-discuss mailing list > perf-discuss@opensolaris.org > _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org