Hi,

I am seeing a strange performance behavior with pthread mutex
operations on my Solaris (4-core) system running S10U5.

The issue is that if the program is run with truss enabled, I am
getting about 3 times more the performance of the run without truss
enabled. Strange...right?

The C program does the following:
   - accept 2 parameters [number of concurrent threads(N), and run-time]
   - create and run N threads simultaneously for <run-time> seconds
   - each thread does the following:
        - acquires the pthread mutex
        - increases the value of it's _own_ long variable by 1
        - releases the pthread mutex
   - after run-time expires, each thread reports the value of it's long variable

Note that I did not use any special flags to compile it. Reply back to
this email if you like to take a look at the program source code.

Let's look at the program without truss enabled.

$ mutex_ops -w4 -r10
   number of threads:4,   run-time (in seconds)=10
   Thread    1:   total ops: 4580775,   ops/second:   458077
   Thread    3:   total ops: 3798001,   ops/second:   379800
   Thread    0:   total ops: 4867850,   ops/second:   486785
   Thread    2:   total ops: 4925299,   ops/second:   492529
$
<<vmstat 1>>
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 -- --   in   sy   cs us sy id
 0 0 0 17121968 14961748 22 205 0 0 0 0  0  0  0  0  0 7960 16218 16155 28 2 71
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 19914 41572 41568 81 3 16
 1 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 18711 39086 38937 80 3 17
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 19006 39373 39382 80 3 17
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 18719 39151 39007 80 3 16
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 19560 41243 40963 81 4 16
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 20498 43671 43203 81 4 16
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 20857 44636 44014 80 4 15
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 19149 41239 40583 80 4 16
 0 0 0 17121968 14961748 0 0 0  0  0  0  0  0  0  0  0 21019 44189 43995 81 4 15
 0 0 0 17122372 14962176 7 29 0 0  0  0  0  0  0  0  0 13062 27003 26973 46 2 52
 0 0 0 17122372 14962176 0 0 0  0  0  0  0  0  0  0  0  461  160  274  0  0 100

As you can see above that I am executing the program with 4 concurrent
threads for 10 seconds. The output shows that each thread has acquired
the mutex lock about 400,000 times a second. The vmstat shows that the
program used about 80% of usertime, 4%system and ~15% idle. So far so
good.

Now, let's look into program with truss case.

$ truss -fc mutex_ops -w4 -r10
   number of threads:4,   run-time (in seconds)=10
   Thread    0:   total ops: 13419838,   ops/second:  1341983
   Thread    2:   total ops: 13959115,   ops/second:  1395911
   Thread    1:   total ops: 13414291,   ops/second:  1341429
   Thread    3:   total ops: 13535670,   ops/second:  1353567

syscall               seconds   calls  errors
_exit                    .000       1
write                    .000       5
open                     .000       3       1
close                    .000       2
stat                     .000      10       7
getpid                   .000       1
fstat                    .000       1
ioctl                    .000       1
execve                   .000       1
lwp_park                 .905  106423
lwp_unpark               .757  106423
sigaction                .000       1
getcontext               .000       1
setustack                .000       5
mmap                     .000      16
munmap                   .000       2
getrlimit                .000       1
memcntl                  .000       1
sysconfig                .000       2
yield                    .182   26004
lwp_create               .000       4
lwp_exit                 .000       4
lwp_continue             .000       4
lwp_sigmask              .000       4
lwp_private              .000       1
lwp_wait                 .000       4
lwp_mutex_wakeup         .000       9
nanosleep                .000       1
schedctl                 .000       5
resolvepath              .000       4
lwp_mutex_timedlock      .000      32
getcwd                   .000       1
                     --------  ------   ----
sys totals:             1.846  238977      8
usr time:              21.363
elapsed:               10.060
$

<<vmstat 1>>
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 -- --   in   sy   cs us sy id
 0 0 0 17122372 14962176 0 0 0  0  0  0  0  0  0  0  0  529  262  335  0  0 100
 0 0 0 17121240 14960660 153 632 0 0 0 0 0  0  0  0  0 25259 91754
87992 37 15 48
 0 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 62751 178228
179102 57 33 10
 0 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 56922 169565
168507 59 30 11
 0 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 60935 178967
178535 58 32 10
 0 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 61581 180239
179751 58 32 10
 0 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 61486 181664
180513 58 33 10
 1 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 67941 186205
190703 55 35 10
 0 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 59981 175708
175758 58 32 10
 1 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 60485 174557
175091 58 32 11
 1 0 0 17121104 14960524 0 0 0  0  0  0  0  0  0  0  0 62799 178328
179880 57 32 10
 0 0 0 17122236 14962040 8 17 0 0  0  0  0  0  0  0  0 19483 58122
58317 19 10 71
 0 0 0 17122236 14962040 0 0 0  0  0  0  0  0  0  0  0  449  160  279  0  0 100

Can you spot the issue? I used the same parameters(4 concurrent
threads, 10 seconds run-time), but this time *with truss enabled*. The
output shows that each thread has acquired the mutex lock about
1,300,000 times a second. vmstat shows that the program is using a lot
more system time than before. However, I don't know how truss can
cause the system time to go up, which ultimately increases the
performance. Am I missing something?

BTW, I confirmed the mutex locks generated by the program using
dtrace. The values are indeed correct for both the cases.

Any inputs/comments are appreciated.

thanks.

senthil
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to