Re: b_freelist TAILQ/SLIST
On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote: Hi. While doing some profiles of GEOM/CAM IOPS scalability, on some test patterns I've noticed serious congestion with spinning on global pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is already very simple, I've tried to optimize probably the only thing possible there: switch bswlist from TAILQ to SLIST. As I can see, b_freelist field of struct buf is really used as TAILQ in some other places, so I've just added another SLIST_ENTRY field. And result appeared to be surprising -- I can no longer reproduce the issue at all. May be it was just unlucky synchronization of specific test, but I've seen in on two different systems and rechecked results with/without patch three times. This is too unbelievable. Could it be, e.g. some cache line conflicts which cause the trashing, in fact ? Does it help if you add void *b_pad before b_freelist instead of adding b_freeslist ? The present patch is here: http://people.freebsd.org/~mav/buf_slist.patch The question is how to do it better? What is the KPI/KBI policy for struct buf? I could replace b_freelist by a union and keep KBI, but partially break KPI. Or I could add another field, probably breaking KBI, but keeping KPI. Or I could do something handmade with no breakage. Or this change is just a bad idea? The same question about using union for b_freelist/b_freeslist, does the effect of magically fixing the contention still there if b_freeslist is on the same offset as the b_freelist ? There are no K{B,P}I policy for struct buf in HEAD, just change it as it fits. pgpGt3DQKL6vB.pgp Description: PGP signature
Re: expanding amd64 past the 1TB limit
On Thu, Jun 27, 2013 at 03:23:36PM -0600, Chris Torek wrote: OK, I wasted :-) way too much time, but here's a text file that can be comment-ified or stored somewhere alongside the code or whatever... I think it would be a nice addition to the VM article in the doc collection. The content is correct and useful, IMO. See some notes below. The reason for having those empty (all-zero) PTE pages is that whenever new processes are created, in pmap_pinit(), they have their (new) L4 PTE page set up to point to the *same* physical pages that the kernel is using. Thus, if the kernel creates or destroys any level-3-or-below mapping by writing into any of the above four pages, that mapping is also created/destroyed in all processes. Similarly, the NDMPML4 pages starting at DMPDPphys are mapped identically in all processes. The kernel can therefore borrow a user pmap at any time, i.e., there's no need to adjust the CPU's CR4 on entry to the kernel. It is %cr3. (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E entries into the new pmap, the L3 pages would not have to be physically contiguous, but the KVA ones would still all have to exist. It's free to allocate physically contiguous pages here anyway though.) I do not see how the physical continuity of the allocated page table pages is relevant there. Copying the L4 or L3 PTEs would cause serious complications. In fact, this is how i386 operates, since due to limited VA, pmap does not have a luxurly of allocating entire L(top-1) page table pages to the kernel. As result, pmap_kenter_pde() have to iterate over the whole pmap list, which is maintained just to be able to update KVA mappings. So, the last NKPML4E slots in KPML4phys point to the following page tables, which use all of L3, L2, and L1 style PTEs. (Note that we did not need any L1 PTEs for the direct map, which always uses 2MB or 1GB super-pages.) This is not quite true. In the initial state, indeed all PTEs for direct map are superpages, either 1G or 2M. But Intel states that a situation when the physical page has mappings with different caching modes causes undefined behaviour. As result, if a page is remapped with non-write back caching attributes, the direct map has to demote the superpage and adjust the mapping attribute of the page frame for the page. As result, it is possible to have small mappings in the direct map. See pmap_page_set_memattr(), which delegates the work to pmap_change_attr(_locked). pgpHAwuqmlvZy.pgp Description: PGP signature
Re: b_freelist TAILQ/SLIST
On 28.06.2013 09:57, Konstantin Belousov wrote: On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote: While doing some profiles of GEOM/CAM IOPS scalability, on some test patterns I've noticed serious congestion with spinning on global pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is already very simple, I've tried to optimize probably the only thing possible there: switch bswlist from TAILQ to SLIST. As I can see, b_freelist field of struct buf is really used as TAILQ in some other places, so I've just added another SLIST_ENTRY field. And result appeared to be surprising -- I can no longer reproduce the issue at all. May be it was just unlucky synchronization of specific test, but I've seen in on two different systems and rechecked results with/without patch three times. This is too unbelievable. I understand that it looks like a magic. I was very surprised to see contention there at all, but `pmcstat -n 1000 -TS unhalted-cycles` shows it too often and repeatable: PMC: [CPU_CLK_UNHALTED_CORE] Samples: 28052 (100.0%) , 12 unresolved %SAMP IMAGE FUNCTION CALLERS 46.4 kernel __mtx_lock_sleep relpbuf:22.3 getpbuf:22.0 xpt_run_devq:0.8 13.3 kernel _mtx_lock_spin_cooki turnstile_trywait 4.3 kernel cpu_search_lowestcpu_search_lowest 2.3 kernel getpbuf physio , and benchmark results confirm it. Could it be, e.g. some cache line conflicts which cause the trashing, in fact ? Does it help if you add void *b_pad before b_freelist instead of adding b_freeslist ? No, this doesn't help. And previously I've tested it also with b_freeslist in place but without other changes -- it didn't help either. The present patch is here: http://people.freebsd.org/~mav/buf_slist.patch The question is how to do it better? What is the KPI/KBI policy for struct buf? I could replace b_freelist by a union and keep KBI, but partially break KPI. Or I could add another field, probably breaking KBI, but keeping KPI. Or I could do something handmade with no breakage. Or this change is just a bad idea? The same question about using union for b_freelist/b_freeslist, does the effect of magically fixing the contention still there if b_freeslist is on the same offset as the b_freelist ? Yes, it is. There are no K{B,P}I policy for struct buf in HEAD, just change it as it fits. Which one would you prefer, the original or http://people.freebsd.org/~mav/buf_slist2.patch ? Thank you. -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: bsd boot sequence
On 6/27/13, Brian Kim briansa...@gmail.com wrote: howdy all, As a junior computer engineering major who has dreams of developing an operating system more ubiquitous than ms windows, I have come to appreciate the complexity and elegance of both the freebsd os and community. While achieving proficiency in C programming yet lacking in both UNIX utility know-how and shell scripting, I would rate my hacker savvy to be on an intermediate level. By the time I graduate from college, I hope to have a thorough understanding of the bsd kernel on the lowest level. Don't get me wrong though: while my end-game may be to develop my own operating system, I will forever be a contributor to freebsd. As my understanding of computers develops, you will undoubtedly see my name appear more frequently on the freebsd-hacker emails. Unfortunately, I am just not at the level of understanding just yet. To start, I was wondering if someone could briefly explain the operations and function calls that occur at boot time. I wish to thoroughly examine the freebsd source code but I'm afraid the sheer volume of code that exists leaves me with no real starting point for my research. Thank you all for your time. -- Best Wishes, Brian Kim Boot's probably not as bad as it seems. There's boot0, boot1, boot2, loader, and kernel. So those would probably be the best places to start digging. A few hundred k of code if I'm not mistaken. Not entirely unmanageable. Some ideas that might help: - Scan source for comments and function/method names - Use doxygen to generate high-level source documentation - Generate UML diagrams from source (class and sequence diagrams would probably be most useful) - As Warren suggested, read all the documentation you can find - Google - Take notes as you go, publish them on a blog, and share the URL with us! -David ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: IPv6 - sobind fails with 49
On 28.06.2013 08:15, Sreenivasa Honnur wrote: if (rv != 0) { printf(sock create ipv6 %s failed %d.\n, tbuf, rv); return NULL; } Can you try insert here? bzero(saddr6, sizeof(saddr6)); saddr6.sin6_family = AF_INET6; -- WBR, Andrey V. Elsukov ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
.. i'd rather you narrow down _why_ it's performing better before committing it. Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. You're using instructions-retired; how about using l1/l2 cache loads, stores, etc? There's a lot more CPU counters available. You have a very cool problem to solve. If I could reproduce it locally I'd give you a hand. Thanks, -adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
On 28.06.2013 18:14, Adrian Chadd wrote: .. i'd rather you narrow down _why_ it's performing better before committing it. If you have good guesses -- they are welcome. All those functions are so small, that it is hard to imagine how congestion may happen there at all. I have strong feeling that lock spinning there consumes incomparably much more CPU time then the locked region itself could consume. Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. Big win or small, TAILQ is still heavier then STAILQ, while it is not needed there at all. You're using instructions-retired; how about using l1/l2 cache loads, stores, etc? There's a lot more CPU counters available. I am using unhalted-cycles, that is more reasonable then instructions-retired. What's about other counters, there are indeed a lot of them, but it is not always easy to get something useful out of them. You have a very cool problem to solve. If I could reproduce it locally I'd give you a hand. You'd need a lot of hardware and patches to reproduce it in full. But if you like to see this with your own eyes, I can give you an SSH access to my test machine. -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote: Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. Big win or small, TAILQ is still heavier then STAILQ, while it is not needed there at all. You can't make that assumption. I bet that if both pointers are in the _same_ cache line, the overhead of maintaining a double linked list is trivial. You've already bitten the overhead of loading the struct from RAM into L1/L2 cache; once that's done, the second pointer operation should just hit the cache. Same with the store - if they're in the same cache line, the overhead will be very small. The memory controller is likely operating on multiples of L1 cache lines anyway. The only time I can see that being a benefit (on current hardware) is if you're trying to pack more into cache lines; but then if you're doing that, you're likely better off with an array rather than lists. That way you don't waste 4/8 (or 2x that for the double-linked list) bytes for each pointer for each list element, as well as another 4/8 byte pointer to 'self'. But that doesn't -always- give a boost; it also depends upon memory layout and whether you're using all the cache lines or not. So yes - I think you're on to something and I think it's worth digging deeper into what's going on before you commit something. No, don't commit the SLIST change until you absolutely, positively know why it's actually being helpful. It's way way too voodoo right now. Thanks, Adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
On Fri, Jun 28, 2013 at 08:14:42AM -0700, Adrian Chadd wrote: .. i'd rather you narrow down _why_ it's performing better before committing it. Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. Or penalize some other set of machines where this is currently not a problem. The cause should be identified before any change is committed. You're using instructions-retired; how about using l1/l2 cache loads, stores, etc? There's a lot more CPU counters available. You have a very cool problem to solve. If I could reproduce it locally I'd give you a hand. Thanks, -adrian pgpY5K5ioAQsN.pgp Description: PGP signature
Re: b_freelist TAILQ/SLIST
On Fri, Jun 28, 2013 at 8:56 AM, Adrian Chadd adr...@freebsd.org wrote: On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote: Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. Big win or small, TAILQ is still heavier then STAILQ, while it is not needed there at all. You can't make that assumption. I bet that if both pointers are in the _same_ cache line, the overhead of maintaining a double linked list is trivial. No, it's not. A singly-linked SLIST only needs to modify the head of the list and the current element. A doubly-linked LIST needs to modify both the head as well as the old first element, which may not be in cache (and may not be in the same TLB, either). I don't recall exactly what [S]TAILQ touches, but the doubly-linked version still has to modify more entries that are not contiguous. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
On 28.06.2013 18:56, Adrian Chadd wrote: On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote: Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. Big win or small, TAILQ is still heavier then STAILQ, while it is not needed there at all. You can't make that assumption. I bet that if both pointers are in the _same_ cache line, the overhead of maintaining a double linked list is trivial. You've already bitten the overhead of loading the struct from RAM into L1/L2 cache; once that's done, the second pointer operation should just hit the cache. Same with the store - if they're in the same cache line, the overhead will be very small. The memory controller is likely operating on multiples of L1 cache lines anyway. The only time I can see that being a benefit (on current hardware) is if you're trying to pack more into cache lines; but then if you're doing that, you're likely better off with an array rather than lists. That way you don't waste 4/8 (or 2x that for the double-linked list) bytes for each pointer for each list element, as well as another 4/8 byte pointer to 'self'. But that doesn't -always- give a boost; it also depends upon memory layout and whether you're using all the cache lines or not. That is true, but still TAILQ_INSERT/_REMOVE modify one more cache line then SLIST and also use branching. So yes - I think you're on to something and I think it's worth digging deeper into what's going on before you commit something. No, don't commit the SLIST change until you absolutely, positively know why it's actually being helpful. It's way way too voodoo right now. OK, I'll dig it more. -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: expanding amd64 past the 1TB limit
On Wed, Jun 26, 2013 at 10:11:44AM -0600, Chris Torek wrote: diff --git a/conf/options.amd64 b/conf/options.amd64 index 90348b7..f3ce505 100644 --- a/conf/options.amd64 +++ b/conf/options.amd64 @@ -1,6 +1,7 @@ # $FreeBSD$ # Options specific to AMD64 platform kernels +AMD64_HUGE opt_global.h Is this option needed ? The SMAP is already parsed at the time of pmap_bootstrap() call, so you could determine the amount of physical memory and size the KVA map accordingly ? pgpuR4X3THSPE.pgp Description: PGP signature
Re: b_freelist TAILQ/SLIST
On 28 June 2013 09:18, m...@freebsd.org wrote: You can't make that assumption. I bet that if both pointers are in the _same_ cache line, the overhead of maintaining a double linked list is trivial. No, it's not. A singly-linked SLIST only needs to modify the head of the list and the current element. A doubly-linked LIST needs to modify both the head as well as the old first element, which may not be in cache (and may not be in the same TLB, either). I don't recall exactly what [S]TAILQ touches, but the doubly-linked version still has to modify more entries that are not contiguous. Good point. -adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: expanding amd64 past the 1TB limit
[combining two messages and adding kib and alc to cc per Oliver Pinter] the CPU's CR4 on entry to the kernel. It is %cr3. Oops, well, easily fixed. :-) (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E entries into the new pmap, the L3 pages would not have to be physically contiguous, but the KVA ones would still all have to exist. It's free to allocate physically contiguous pages here anyway though.) I do not see how the physical continuity of the allocated page table pages is relevant there. Not in create_pagetables(), no, but later in pmap_pinit(), which has loops to set pmap-pm_pml4[x] for kernel and and direct-map. And: Copying the L4 or L3 PTEs would cause serious complications. Perhaps what I wrote was a little fuzzy. Here's the pmap_pinit() code I was referring to, as modified (the original version has only the second loop -- it assumes NKPML4E is always 1 so it just sets pml4[KPML4I]): pmap-pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pml4pg)); if ((pml4pg-flags PG_ZERO) == 0) pagezero(pmap-pm_pml4); for (i = 0; i NKPML4E; i++) { pmap-pm_pml4[KPML4BASE + i] = (KPDPphys + (i PAGE_SHIFT)) | PG_RW | PG_V | PG_U; } for (i = 0; i NDMPML4E; i++) { pmap-pm_pml4[DMPML4I + i] = (DMPDPphys + (i PAGE_SHIFT)) | PG_RW | PG_V | PG_U; } /* install self-referential address mapping entry(s) */ These require that KPDPphys and DMPDPphys both point to the first of n physically-contiguous pages. But suppose we did this (this is deliberately simple for illustration, and furthermore I am assuming here that vmspace0 never acquires any user-level L4 entries): pmap-pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pml4pg)); /* Clear any junk and wire in kernel global address entries. */ bcopy(vmspace0.vm_pmap.pm_pml4, pmap-pm_pml4, NBPG); /* install self-referential address mapping entry(s) */ Now whatever we set up in create_pagetables() is simply copied to new (user) pmaps, so we could go totally wild if we wanted. :-) So, the last NKPML4E slots in KPML4phys point to the following page tables, which use all of L3, L2, and L1 style PTEs. (Note that we did not need any L1 PTEs for the direct map, which always uses 2MB or 1GB super-pages.) This is not quite true. In the initial state, indeed all PTEs for direct map are superpages, either 1G or 2M. But Intel states that a situation when the physical page has mappings with different caching modes causes undefined behaviour. As result, if a page is remapped with non-write back caching attributes, the direct map has to demote the superpage and adjust the mapping attribute of the page frame for the page. Yes, this particular bit of description was restricted to the setup work in create_pagetables(). (Perhaps I should take out always, or substitute initially?) Also, I think I left out a description of the loop where some KPDphys entries are overwritten with 2MB mappings. +AMD64_HUGE opt_global.h Is this option needed ? The SMAP is already parsed at the time of pmap_bootstrap() call, so you could determine the amount of physical memory and size the KVA map accordingly ? Mostly I was afraid of the consequences on VM_MIN_KERNEL_ADDRESS, which is #included so widely, and any complaints people might have about: - wasting NKPML4E-2 (i.e., 14) pages on small AMD64 systems (for the new empty L3 pages in KPDphys that will likely not be used); - wasting yet another page because dynamic memory will start at the first new L3 page (via KPML4BASE) instead of just using the KMPL4I'th one because VM_MIN_KERNEL_ADDRESS is now at -8TB instead of -.5TB -- with VM_MIN_KERNEL_ADDRESS at -.5TB, all KVAs use the single KMPL4I'th slot; - wasting 30 more pages because NDMPML4E grew from 2 to 32; and - adding a loop to set up NKPML4E entries in every pmap, instead of the single shove KPDphys into one slot code that used to be there, and making the pmap_pinit loop run 32 times instead of just 2 for the direct map. Adding these up, the option chews up 45 pages, or 180 kbytes, when compared to the current setup (1 TB direct map, .5 TB kernel VM). 180 kbytes is pretty trivial if you're planning to have a couple of terabytes of RAM, but on a tiny machine ... of course if it's that tiny you could run as i386, in 32 bit mode. :-) If we copied the kernel's L4 table to new pmaps -- or even just put in a new ndmpdpphys variable -- we could avoid allocating any pages for DMPDPphys that we know won't actually be used. That would fix the 30 extra pages above, and even regain one page on many amd64 setups (those with = 512 GB). We'd be down to just 14 extra pages = 56 kbytes, and the new loop in pmap_pinit(). Here's prototype code for sizing DMPDPphys, for illustration: old:DMPDPphys = allocpages(firstaddr,
Re: b_freelist TAILQ/SLIST
On 28.06.2013 09:57, Konstantin Belousov wrote: On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote: While doing some profiles of GEOM/CAM IOPS scalability, on some test patterns I've noticed serious congestion with spinning on global pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is already very simple, I've tried to optimize probably the only thing possible there: switch bswlist from TAILQ to SLIST. As I can see, b_freelist field of struct buf is really used as TAILQ in some other places, so I've just added another SLIST_ENTRY field. And result appeared to be surprising -- I can no longer reproduce the issue at all. May be it was just unlucky synchronization of specific test, but I've seen in on two different systems and rechecked results with/without patch three times. This is too unbelievable. Could it be, e.g. some cache line conflicts which cause the trashing, in fact ? I think it indeed may be a cache trashing. I've made some profiling for getpbuf()/relpbuf() and found interesting results. With patched kernel using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in relpbuf() -- first lock acquisition causes 78% of them. Later memory accesses including the lock release are hitting the same cache line and almost free. With clean kernel using TAILQ I see RESOURCE_STALLS.ANY spread almost equally between lock acquisition, bswlist access and lock release. It looks like the cache line is constantly erased by something. My guess was that patch somehow changed cache line sharing. But several checks with nm shown that, while memory allocation indeed changed slightly, in both cases content of the cache line in question is absolutely the same, just shifted in memory by 128 bytes. I guess the cache line could be trashed by threads doing adaptive spinning on lock after collision happened. That trashing increases lock hold time and even more increases chance of additional collisions. May be switch from TAILQ to SLIST slightly reduces lock hold time, reducing chance of cumulative effect. The difference is not big, but in this test this global lock acquired 1.5M times per second by 256 threads on 24 CPUs (12xL2 and 2xL3 caches). Another guess was that we have some bad case of false cache line sharing, but I don't know how that can be either checked or avoided. At the last moment mostly for luck I've tried to switch pbuf_mtx from mtx to mtx_padalign on clean kernel. For my surprise that also seems fixed the congestion problem, but I can't explain why. RESOURCE_STALLS.ANY still show there is cache trashing, but the lock spinning has gone. Any ideas about what is going on there? -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
On 28 June 2013 15:15, Alexander Motin m...@freebsd.org wrote: I think it indeed may be a cache trashing. I've made some profiling for getpbuf()/relpbuf() and found interesting results. With patched kernel using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in relpbuf() -- first lock acquisition causes 78% of them. Later memory accesses including the lock release are hitting the same cache line and almost free. With clean kernel using TAILQ I see RESOURCE_STALLS.ANY spread almost equally between lock acquisition, bswlist access and lock release. It looks like the cache line is constantly erased by something. Can you narrow down the resource stall check to each of the sub-types? See which one/ones it is? -adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
On Sat, Jun 29, 2013 at 01:15:19AM +0300, Alexander Motin wrote: On 28.06.2013 09:57, Konstantin Belousov wrote: On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote: While doing some profiles of GEOM/CAM IOPS scalability, on some test patterns I've noticed serious congestion with spinning on global pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is already very simple, I've tried to optimize probably the only thing possible there: switch bswlist from TAILQ to SLIST. As I can see, b_freelist field of struct buf is really used as TAILQ in some other places, so I've just added another SLIST_ENTRY field. And result appeared to be surprising -- I can no longer reproduce the issue at all. May be it was just unlucky synchronization of specific test, but I've seen in on two different systems and rechecked results with/without patch three times. This is too unbelievable. Could it be, e.g. some cache line conflicts which cause the trashing, in fact ? I think it indeed may be a cache trashing. I've made some profiling for getpbuf()/relpbuf() and found interesting results. With patched kernel using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in relpbuf() -- first lock acquisition causes 78% of them. Later memory accesses including the lock release are hitting the same cache line and almost free. With clean kernel using TAILQ I see RESOURCE_STALLS.ANY spread almost equally between lock acquisition, bswlist access and lock release. It looks like the cache line is constantly erased by something. My guess was that patch somehow changed cache line sharing. But several checks with nm shown that, while memory allocation indeed changed slightly, in both cases content of the cache line in question is absolutely the same, just shifted in memory by 128 bytes. I guess the cache line could be trashed by threads doing adaptive spinning on lock after collision happened. That trashing increases lock hold time and even more increases chance of additional collisions. May be switch from TAILQ to SLIST slightly reduces lock hold time, reducing chance of cumulative effect. The difference is not big, but in this test this global lock acquired 1.5M times per second by 256 threads on 24 CPUs (12xL2 and 2xL3 caches). Another guess was that we have some bad case of false cache line sharing, but I don't know how that can be either checked or avoided. At the last moment mostly for luck I've tried to switch pbuf_mtx from mtx to mtx_padalign on clean kernel. For my surprise that also seems fixed the congestion problem, but I can't explain why. RESOURCE_STALLS.ANY still show there is cache trashing, but the lock spinning has gone. Any ideas about what is going on there? FWIW, Jeff just changed pbuf_mtx allocation to use padalign, it is a somewhat unrelated change in r252330. Are pbuf_mtx and bswlist are located next to next in your kernel ? If yes, then I would expect that the explanation is how the MESI protocol and atomics work. When performing the locked op, CPU takes the whole cache line into the exclusive ownership. Since our locks try the cmpset as the first operation, and then 'adaptive' loop interleaving cmpset and check for the ownership, false cache line sharing between pbuf_mtx and bswlist should result exactly in such effects. Different cores should bounce the ownership of the cache line, slowing down the accesses. AFAIR Intel exports some MESI events as performance counters, but I was not able to find a reference in the architecturally defined events. It seems to be present in the model-specific part. pgpzr8SbYK8oQ.pgp Description: PGP signature