Re: [Pvfs2-developers] threaded client-core and the device thread

Dean Hildebrand Tue, 17 Oct 2006 08:58:45 -0700

Hi Murali/Phil,

Murali Vilayannur wrote:

Hi Sam,
Dean and I are looking at trying to push the efficiency of requestsfrom the kernel module up through the device to client-core. I addedthe --threaded option to the client to allow the client-core to runwith multiple threads (one each for bmi, dev, and main -- and also aremount thread, but lets ignore that for now), so the device threadshould be able to keep pulling requests of the device without havingto wait for bmi operations to complete.
Cool!
This could address some of the performance problems that Phil also hadpointed a while back where multiple outstanding requests were slowerthan a single outstanding request.

Just to see if I'm noticing the same issue, what was the exact problemPhil was noticing? Shouldn't multiple requests take longer than asingle request?

The workload I was using was multiple rpc.nfsd threads issuing 64 KBrequests (through the writev/readv interface) to the PVFS2 kernel module(and then to client-core and so on). To make things easy, I bet usingiozone with multiple threads and a random workload would simulate thisworkload quite well. What I was noticing is that although we haven'treached disk, cpu, or network limits, the I/O throughput is fixed atsome low value.

One test Sam and I tried was to increase the number of kernel mmappedbuffers. Instead of five 4MB buffers, we used sixty-four 128KBbuffers. This reduced performance considerably, especially readperformance. Since we are using 64KB requests, this should not be anissue, but it was. One thing we didn't get a chance to try was if thereduced performance was because of the increase in buffers or thereduction in size. My guess would be the increase, but why would this be?

Beyond inefficient coding issues, Sam and I talked about where thebottleneck could be from a design standpoint. We came up with thefollowing list:

0) kmapping and copying data is going at fast as possible

1) Sending message through the pvfs2-req device can only happen at aconstant rate.2) client-core reading message off the pvfs2-req device (should nolonger be an issue with the --threaded option, but maybe reading 5 at atime is still inefficient)3) A single BMI thread issuing I/O requests. Are multiple threadsnecessary to issue the multiple I/O requests from the kernel?

Can anyone think of other parts of the I/O path that might be abottleneck? So far, we have only started to investigate items 1 and 2.


Thanks for everyone's help.
Dean

PINT_dev_test_unexpected takes an incount of 5, so its only going toread at most 5 requests off the device for each call. Once itreturns, each of the unexpected requests is added to the completedjobs array and then we signal the jobs completed condition variable_for each request_. It seems like this will be 5x the number ofcontext switches between the device thread and the main thread thatwe need.
Also, we poll every time before reading another request off thedevice. What about trying to read a number of requests off thedevice at once with one read (or possibly a readv so we can keepseparate buffers per request).
Hmm.. both of these are good points. I had dabbled with doing a readva while back. It might make a difference although I suspect this mightbe in the noise region sinceif there are requests to be serviced, poll() will only take the timeof a syscall which should be pretty fast these days.. but worth a shot.
Also, it looks like we do a malloc for each new request buffer, andthen a free once we're done with it, and a memset of the infostruct. It seems like we could manage the buffers on the stackinstead of the heap, and save on a few system calls there.
Now we are definitely in the noise region.. :) just kidding. glibc'smalloc implementation should typically amortize overheads in invokingsystem calls (sbrk etc).
For both threaded and nonthreaded, with the workload that Dean isusing, he found that the PINT_dev_test_unexpected always returned 5requests in the outcount. So it looks like there are always requestssitting on the device, waiting to be read by client-core. Are wejust not able to process requests fast enough through BMI and thestate machines, or is the cost of polling and signaling every time weread a request off the device slowing us down? In other words, doesit make sense to rework the code a little bit or will we just getbottlenecked elsewhere?
It is definitely interesting to try all this out, but I am not sure ifthe bottlenecks are here or elsewhere.
What does this workload do by the way?

thanks,
Murali
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


--
Dean Hildebrand
Ph.D. Candidate
University of Michigan

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] threaded client-core and the device thread

Reply via email to