Re: [Pvfs2-developers] threaded client-core and the device thread

Sam Lang Fri, 13 Oct 2006 20:13:04 -0700


On Oct 13, 2006, at 10:00 PM, Murali Vilayannur wrote:

Hi Sam,
Dean and I are looking at trying to push the efficiency ofrequests from the kernel module up through the device to client-core. I added the --threaded option to the client to allow theclient-core to run with multiple threads (one each for bmi, dev,and main -- and also a remount thread, but lets ignore that fornow), so the device thread should be able to keep pulling requestsof the device without having to wait for bmi operations to complete.
Cool!
This could address some of the performance problems that Phil alsohad pointed a while back where multiple outstanding requests wereslower than a single outstanding request.

Well it doesn't seem to make a difference, at least with theworkloads that we were trying.

PINT_dev_test_unexpected takes an incount of 5, so its only goingto read at most 5 requests off the device for each call. Once itreturns, each of the unexpected requests is added to the completedjobs array and then we signal the jobs completed conditionvariable _for each request_. It seems like this will be 5x thenumber of context switches between the device thread and the mainthread that we need.
Also, we poll every time before reading another request off thedevice. What about trying to read a number of requests off thedevice at once with one read (or possibly a readv so we can keepseparate buffers per request).
Hmm.. both of these are good points. I had dabbled with doing areadv a while back. It might make a difference although I suspectthis might be in the noise region sinceif there are requests to be serviced, poll() will only take thetime of a syscall which should be pretty fast these days.. butworth a shot.
Also, it looks like we do a malloc for each new request buffer,and then a free once we're done with it, and a memset of the infostruct. It seems like we could manage the buffers on the stackinstead of the heap, and save on a few system calls there.
Now we are definitely in the noise region.. :) just kidding.glibc's malloc implementation should typically amortize overheadsin invoking system calls (sbrk etc).

Dean was seeing memset at the top of list while running oprofile onpvfs2-client-core. malloc and free were also up there.

For both threaded and nonthreaded, with the workload that Dean isusing, he found that the PINT_dev_test_unexpected always returned5 requests in the outcount. So it looks like there are alwaysrequests sitting on the device, waiting to be read by client-core. Are we just not able to process requests fast enoughthrough BMI and the state machines, or is the cost of polling andsignaling every time we read a request off the device slowing usdown? In other words, does it make sense to rework the code alittle bit or will we just get bottlenecked elsewhere?
It is definitely interesting to try all this out, but I am not sureif the bottlenecks are here or elsewhere.
What does this workload do by the way?

If I understand it correctly, there are a number of threads doingsimultaneous reads or writes (64K block sizes) on the same file.


-sam


thanks,
Murali


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] threaded client-core and the device thread

Reply via email to