On 9/2/15 4:59 PM, Malahal Naineni wrote: > Matt Benjamin [mbenja...@redhat.com] wrote: >> Hi Bill, >> >> There has not been griping. Malahal has done performance measurement. >> >> IIUC (per IRC) Malahal: >> >> 1. has empirical evidence that moving the current Ganesha -dispatch queue- >> bands into lanes >> measurably improves throughput, when the number of worker threads is large >> (apparently, 64 and (!) 256) > > Matt, correct! "perf record" showed time spent in pthread_mutex_lock, > unlock routines there, and having 256 lanes did help with some > performance. > Looking forward to the patch. Can probably merge it with mine.
>> That is expected, so I'm looking to Malahal to send a change (and some other >> tuning changes) for review. >> >> 2. Malahal indicated he found a bug of some kind in the thread fridge, >> provoked by the shutdown check >> in the current dispatch queue code, which he says he fixed, so if he hasn't >> already sent a change, I'm >> expecting to see one soon which addresses this. > > I didn't find any bug, but each worker thread calling > fridgethr_you_should_break() repeatedly prior to taking each request > showed too much time spent in the pthread_mutex_lock. I just removed the > lock/unlock around transitioning field check. > Yeah, assuming you changed to atomic? Or was it idempotent? There are too many locks around setting 1 variable. >> 3. Malahal described in IRC an additional change he made to split the ntirpc >> output ioq into >> lanes, and believed he saw improvement (as of ~2 weeks ago), but was still >> benchmarking in order to >> split out the impact of this change relative to others. > > I did this as well, but the gains seemed marginal at best! > As I'd suspected. My patch there will eventually eliminate the lock entirely. > We were desperate and make all kinds of changes here and there. The > system is busy now. Once I get the system back, I will validate > individual patches, and post upstream. > > perf record shows that too much time is spent in malloc/free functions. > Reported functions are alloc_nfs_request, alloc_nfs_res, and few objects > in src/xdr_ioq.c file. alloc_nfs_res seems thread specific, so could be > allocated one per thread. If we can make other pools lockless (instead > of malloc), that would be great! > Oh, yeah! We know about this! (It's not really a pool, just a spot where a pool would be good, so Adam wrapped some pool logic around it.) Those 3 can be combined into 1. But there's a spot of code that changes the pointer to the msg. That needs to be worked out. I've already reduced the number of malloc/frees per by at least 1/3 earlier in this dev cycle. Various auth arrays are pre-allocated, appended onto the request message. Worker stubs are pre-allocated into the contexts they'll be queing. ------------------------------------------------------------------------------ Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 _______________________________________________ Nfs-ganesha-devel mailing list Nfs-ganesha-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel