On Apr 10, 2008, at 9:58 AM, Zeljko Vrba wrote:
> I have a 1:M producer-consumer problem (1 producer, M consumers)  
> that I use for benchmarking.  Every (producer, consumer_i) pair has  
> a dedicated message queue implemented with a linked list protected  
> with one pthread_mutex_t and two pthread_cond_t with default  
> attributes.
>
> A 256MB block of RAM is divided into M = 2^0 .. 2^14 chunks, and for  
> each chunk a new thread is created; the producer has also a  
> dedicated thread (yes, I create up to 16385 threads). For each  
> chunk, a simple message consisting of (pointer,length) pair is sent  
> 128 times to each of the workers and AES-encrypted.  When the worker  
> is finished with a chunk, it sends the reply to the producer. If the  
> message queue is empty, the threads block on their respective  
> condition variables associated with the queues. The producer first  
> sends all 128 chunks to the workers (which are signalled if  
> necessary), then waits for 128 replies from each of them.  This is  
> done serially, and not round-robin (i.e. first 128 messages are sent  
> to worker 0, then to worker 1, etc..; similarly when reading  
> replies). Each communication also includes allocation and freeing of  
> a message; for this the umem allocator is used with a cache.  The  
> threads' stack sizes are set to 16kB.
>
> The benchmark is compiled in 64-bit mode and executed on Solaris 10,  
> dual-core AMD64 (1 socket with two cores) and 2GB of RAM. Now the  
> results: for M=2^1 .. M=2^11 (2 .. 2048) threads, the running time  
> (wall-clock time) is fairly constant around ~10 seconds.  Beyond  
> this number, as M doubles, the running time also roughly doubles:  
> (2^12 threads, 13s), (2^13 threads, 20s), (2^14 threads, 35s).
>
> Running iostat and vmstat in parallel confirms that no swapping  
> occurs. 33% of time is reported to be spent in system, (with 9% of  
> CPU time idle?!), with ~150k/sec systemcalls and ~120k/sec context  
> switches.
>
> Can anybody offer some insight on why this sudden degradation in  
> performance occurs?


Do you have administrator access to this system?  With dtrace(1M), you  
can drill down on this, and get more data on what's going on.  See:

http://www.sun.com/bigadmin/content/dtrace/

for an intro.  As a start, get the output from:

   dtrace -n '[EMAIL PROTECTED](), ustack()] = count();} END  
{trunc(@, 10);}' \
      -c "your test command here"

where 'your test command here' is replaced with an invocation with  
2^13 threads.[1]  This will give us more data to start from.

Cheers,
- jonathan

[1]  This invocation collects kernel and userland stack traces from  
all CPUs 97 times a second, while your command is running.  We count  
the number of times an identical stack trace is hit.  When the test  
command finishes, we keep only the 10 "most-hit" traces, and the  
dtrace command prints them out, along with your counts.  On an idle  
system, you'd see something like:

# dtrace -n'[EMAIL PROTECTED](), ustack()] = count();} END {trunc(@,  
10);}' -c 'sleep 2'
dtrace: description 'profile-97' matched 2 probes
dtrace: pid 102979 has exited
CPU     ID                    FUNCTION:NAME
  10      2                             :END


               unix`mach_cpu_idle+0x17
               unix`cpu_idle+0xdd
               unix`idle+0x10e
               unix`thread_start+0x8

              3120



--------------------------------------------------------------------------
Jonathan Adams, Sun Microsystems, ZFS Team    http://blogs.sun.com/jwadams

_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to