Re: b_freelist TAILQ/SLIST

Alexander Motin Fri, 28 Jun 2013 15:15:34 -0700

On 28.06.2013 09:57, Konstantin Belousov wrote:

On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote:

While doing some profiles of GEOM/CAM IOPS scalability, on some test
patterns I've noticed serious congestion with spinning on global
pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is
already very simple, I've tried to optimize probably the only thing
possible there: switch bswlist from TAILQ to SLIST. As I can see,
b_freelist field of struct buf is really used as TAILQ in some other
places, so I've just added another SLIST_ENTRY field. And result
appeared to be surprising -- I can no longer reproduce the issue at all.
May be it was just unlucky synchronization of specific test, but I've
seen in on two different systems and rechecked results with/without
patch three times.

This is too unbelievable.  Could it be, e.g. some cache line conflicts
which cause the trashing, in fact ?

I think it indeed may be a cache trashing. I've made some profiling forgetpbuf()/relpbuf() and found interesting results. With patched kernelusing SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY inrelpbuf() -- first lock acquisition causes 78% of them. Later memoryaccesses including the lock release are hitting the same cache line andalmost free. With "clean" kernel using TAILQ I see RESOURCE_STALLS.ANYspread almost equally between lock acquisition, bswlist access and lockrelease. It looks like the cache line is constantly erased by something.

My guess was that patch somehow changed cache line sharing. But severalchecks with nm shown that, while memory allocation indeed changedslightly, in both cases content of the cache line in question isabsolutely the same, just shifted in memory by 128 bytes.

I guess the cache line could be trashed by threads doing adaptivespinning on lock after collision happened. That trashing increases lockhold time and even more increases chance of additional collisions. Maybe switch from TAILQ to SLIST slightly reduces lock hold time, reducingchance of cumulative effect. The difference is not big, but in this testthis global lock acquired 1.5M times per second by 256 threads on 24CPUs (12xL2 and 2xL3 caches).

Another guess was that we have some bad case of false cache linesharing, but I don't know how that can be either checked or avoided.

At the last moment mostly for luck I've tried to switch pbuf_mtx frommtx to mtx_padalign on "clean" kernel. For my surprise that also seemsfixed the congestion problem, but I can't explain why.RESOURCE_STALLS.ANY still show there is cache trashing, but the lockspinning has gone.


Any ideas about what is going on there?

--
Alexander Motin
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[email protected]"

Re: b_freelist TAILQ/SLIST

Reply via email to