Re: [Pvfs2-developers] threaded client-core and the device thread

Sam Lang Mon, 23 Oct 2006 17:19:44 -0700


On Oct 16, 2006, at 8:37 AM, Phil Carns wrote:

Sam Lang wrote:
Hi All,
Dean and I are looking at trying to push the efficiency ofrequests from the kernel module up through the device to client-core. I added the --threaded option to the client to allow theclient-core to run with multiple threads (one each for bmi, dev,and main -- and also a remount thread, but lets ignore that fornow), so the device thread should be able to keep pullingrequests of the device without having to wait for bmi operationsto complete.I noticed a couple things with the device thread that I wanted toask about.PINT_dev_test_unexpected takes an incount of 5, so its only goingto read at most 5 requests off the device for each call. Onceit returns, each of the unexpected requests is added to thecompleted jobs array and then we signal the jobs completedcondition variable _for each request_. It seems like this willbe 5x the number of context switches between the device threadand the main thread that we need.Also, we poll every time before reading another request off thedevice. What about trying to read a number of requests off thedevice at once with one read (or possibly a readv so we can keepseparate buffers per request).Also, it looks like we do a malloc for each new request buffer,and then a free once we're done with it, and a memset of theinfo struct. It seems like we could manage the buffers on thestack instead of the heap, and save on a few system calls there.For both threaded and nonthreaded, with the workload that Dean isusing, he found that the PINT_dev_test_unexpected always returned5 requests in the outcount. So it looks like there are alwaysrequests sitting on the device, waiting to be read by client-core. Are we just not able to process requests fast enoughthrough BMI and the state machines, or is the cost of polling andsignaling every time we read a request off the device slowing usdown? In other words, does it make sense to rework the code alittle bit or will we just get bottlenecked elsewhere?
I am just speculating, but out of the things you list I would guessthat these two things would be most likely to show improvementwithout much coding effort:
- increasing the testcount to something higher than 5 (since itsounds like that is getting maxed out for this workload)
- fixing the "signalling on every request problem"

The need for multiple reads and the mallocs could be a problem,
but I am with Murali in that I think problems in this area are morelikely related to inefficient threading or I/O stalls rather thanCPU or memory overhead.

I ran pvfs2-client-core in valgrind, and then ran Bonnie++ a fewtimes (10) on the mounted pvfs volume, and noticed the following whenI stopped the client process:

==20132== malloc/free: 1,298,824 allocs, 1,297,888 frees,3,462,517,583 bytes allocated.

Allocating and freeing 3.5GB seemed extreme, so I went exploring. Itturns out that every time we allocate a PINT_client_sm, we'reallocating about 35KB:


(gdb) p sizeof(struct PINT_client_sm)
$4 = 37764

The problem is that we allocate a PINT_client_sm every time a newrequest is posted. Most of the memory is from the u.lookup field:


(gdb) p sizeof(struct PINT_client_lookup_sm)
$3 = 36196

PINT_client_lookup_sm has a static array of 8PINT_client_lookup_sm_ctx, which itself has a static array 40PINT_client_lookup_sm_segment, which are each about 112 bytes.Anyway, it ends up accumulating.

So I'm convinced at this point that this is beyond the noise range,plus its just cruft that we don't need. I'd like to swap out thosestatic arrays for dynamic allocation when we get to the start of thelookup state machine. Any thoughts or suggestions?


-sam

-Phil


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] threaded client-core and the device thread

Reply via email to