When I ported the old lock API to the new one, I brought forth the code for handling nested locks. In profiling the routines, I noticed that the code for handling the nested locks adds huge overhead for something that is not often used, necessary, or even desired. Here are the numbers I'm seeing from the testlockperf test:
With (now): 1252330 usec Without: 595473 usec (#ifdef'd out) Granted that this is an artificial test, it does give a good measurement of the amount of _overhead_ used in the mutex calls. If we can remove more than half of the overhead for the thread_mutex calls, we can potentially reduce lock contention dramatically for heavily-loaded servers and hot critical-paths in the code. I suspect that this will have a huge impact on multiprocessor servers, where lock contention can effectively starve other CPUs. In many cases, the underlying library can already do nested locking if requested. I propose we simply require that this capability be passed through an attribute flag in the lock initialization routine, so that code not requiring the nested capability can benefit from a faster lock/unlock cycle. -aaron
