Sam Lang wrote:
Hi All,
Dean and I are looking at trying to push the efficiency of requests
from the kernel module up through the device to client-core. I added
the --threaded option to the client to allow the client-core to run
with multiple threads (one each for bmi, dev, and main -- and also a
remount thread, but lets ignore that for now), so the device thread
should be able to keep pulling requests of the device without having to
wait for bmi operations to complete.
I noticed a couple things with the device thread that I wanted to ask
about.
PINT_dev_test_unexpected takes an incount of 5, so its only going to
read at most 5 requests off the device for each call. Once it returns,
each of the unexpected requests is added to the completed jobs array
and then we signal the jobs completed condition variable _for each
request_. It seems like this will be 5x the number of context switches
between the device thread and the main thread that we need.
Also, we poll every time before reading another request off the
device. What about trying to read a number of requests off the device
at once with one read (or possibly a readv so we can keep separate
buffers per request).
Also, it looks like we do a malloc for each new request buffer, and
then a free once we're done with it, and a memset of the info struct.
It seems like we could manage the buffers on the stack instead of the
heap, and save on a few system calls there.
For both threaded and nonthreaded, with the workload that Dean is
using, he found that the PINT_dev_test_unexpected always returned 5
requests in the outcount. So it looks like there are always requests
sitting on the device, waiting to be read by client-core. Are we just
not able to process requests fast enough through BMI and the state
machines, or is the cost of polling and signaling every time we read a
request off the device slowing us down? In other words, does it make
sense to rework the code a little bit or will we just get bottlenecked
elsewhere?
I am just speculating, but out of the things you list I would guess that
these two things would be most likely to show improvement without much
coding effort:
- increasing the testcount to something higher than 5 (since it sounds
like that is getting maxed out for this workload)
- fixing the "signalling on every request problem"
The need for multiple reads and the mallocs could be a problem, but I am
with Murali in that I think problems in this area are more likely
related to inefficient threading or I/O stalls rather than CPU or memory
overhead.
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers