Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Konstantin Belousov
On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote:
 Hi.
 
 While doing some profiles of GEOM/CAM IOPS scalability, on some test 
 patterns I've noticed serious congestion with spinning on global 
 pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is 
 already very simple, I've tried to optimize probably the only thing 
 possible there: switch bswlist from TAILQ to SLIST. As I can see, 
 b_freelist field of struct buf is really used as TAILQ in some other 
 places, so I've just added another SLIST_ENTRY field. And result 
 appeared to be surprising -- I can no longer reproduce the issue at all. 
 May be it was just unlucky synchronization of specific test, but I've 
 seen in on two different systems and rechecked results with/without 
 patch three times.
This is too unbelievable.  Could it be, e.g. some cache line conflicts
which cause the trashing, in fact ? Does it help if you add void *b_pad
before b_freelist instead of adding b_freeslist ?

 
 The present patch is here:
 http://people.freebsd.org/~mav/buf_slist.patch
 
 The question is how to do it better? What is the KPI/KBI policy for 
 struct buf? I could replace b_freelist by a union and keep KBI, but 
 partially break KPI. Or I could add another field, probably breaking 
 KBI, but keeping KPI. Or I could do something handmade with no breakage. 
 Or this change is just a bad idea?
The same question about using union for b_freelist/b_freeslist, does the
effect of magically fixing the contention still there if b_freeslist
is on the same offset as the b_freelist ?

There are no K{B,P}I policy for struct buf in HEAD, just change it as
it fits.


pgpGt3DQKL6vB.pgp
Description: PGP signature


Re: expanding amd64 past the 1TB limit

2013-06-28 Thread Konstantin Belousov
On Thu, Jun 27, 2013 at 03:23:36PM -0600, Chris Torek wrote:
 OK, I wasted :-) way too much time, but here's a text file that
 can be comment-ified or stored somewhere alongside the code or
 whatever...
I think it would be a nice addition to the VM article in the doc
collection.  The content is correct and useful, IMO.
See some notes below.

 The reason for having those empty (all-zero) PTE pages is that
 whenever new processes are created, in pmap_pinit(), they have
 their (new) L4 PTE page set up to point to the *same* physical
 pages that the kernel is using.  Thus, if the kernel creates or
 destroys any level-3-or-below mapping by writing into any of the
 above four pages, that mapping is also created/destroyed in all
 processes.  Similarly, the NDMPML4 pages starting at DMPDPphys are
 mapped identically in all processes.  The kernel can therefore
 borrow a user pmap at any time, i.e., there's no need to adjust
 the CPU's CR4 on entry to the kernel.
It is %cr3.

 
 (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E
 entries into the new pmap, the L3 pages would not have to be
 physically contiguous, but the KVA ones would still all have to
 exist.  It's free to allocate physically contiguous pages here
 anyway though.)
I do not see how the physical continuity of the allocated page table
pages is relevant there.

Copying the L4 or L3 PTEs would cause serious complications.  In fact,
this is how i386 operates, since due to limited VA, pmap does not have
a luxurly of allocating entire L(top-1) page table pages to the kernel.
As result, pmap_kenter_pde() have to iterate over the whole pmap
list, which is maintained just to be able to update KVA mappings.

 
 So, the last NKPML4E slots in KPML4phys point to the following
 page tables, which use all of L3, L2, and L1 style PTEs.  (Note
 that we did not need any L1 PTEs for the direct map, which always
 uses 2MB or 1GB super-pages.)
This is not quite true. In the initial state, indeed all PTEs for direct
map are superpages, either 1G or 2M. But Intel states that a situation
when the physical page has mappings with different caching modes causes
undefined behaviour. As result, if a page is remapped with non-write
back caching attributes, the direct map has to demote the superpage and
adjust the mapping attribute of the page frame for the page.

As result, it is possible to have small mappings in the direct map.
See pmap_page_set_memattr(), which delegates the work to
pmap_change_attr(_locked).


pgpHAwuqmlvZy.pgp
Description: PGP signature


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Alexander Motin

On 28.06.2013 09:57, Konstantin Belousov wrote:

On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote:

While doing some profiles of GEOM/CAM IOPS scalability, on some test
patterns I've noticed serious congestion with spinning on global
pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is
already very simple, I've tried to optimize probably the only thing
possible there: switch bswlist from TAILQ to SLIST. As I can see,
b_freelist field of struct buf is really used as TAILQ in some other
places, so I've just added another SLIST_ENTRY field. And result
appeared to be surprising -- I can no longer reproduce the issue at all.
May be it was just unlucky synchronization of specific test, but I've
seen in on two different systems and rechecked results with/without
patch three times.

This is too unbelievable.


I understand that it looks like a magic. I was very surprised to see 
contention there at all, but `pmcstat -n 1000 -TS unhalted-cycles` 
shows it too often and repeatable:


PMC: [CPU_CLK_UNHALTED_CORE] Samples: 28052 (100.0%) , 12 unresolved

%SAMP IMAGE  FUNCTION CALLERS
 46.4 kernel __mtx_lock_sleep relpbuf:22.3 getpbuf:22.0 
xpt_run_devq:0.8

 13.3 kernel _mtx_lock_spin_cooki turnstile_trywait
  4.3 kernel cpu_search_lowestcpu_search_lowest
  2.3 kernel getpbuf  physio

, and benchmark results confirm it.


Could it be, e.g. some cache line conflicts
which cause the trashing, in fact ? Does it help if you add void *b_pad
before b_freelist instead of adding b_freeslist ?


No, this doesn't help. And previously I've tested it also with 
b_freeslist in place but without other changes -- it didn't help either.



The present patch is here:
http://people.freebsd.org/~mav/buf_slist.patch

The question is how to do it better? What is the KPI/KBI policy for
struct buf? I could replace b_freelist by a union and keep KBI, but
partially break KPI. Or I could add another field, probably breaking
KBI, but keeping KPI. Or I could do something handmade with no breakage.
Or this change is just a bad idea?

The same question about using union for b_freelist/b_freeslist, does the
effect of magically fixing the contention still there if b_freeslist
is on the same offset as the b_freelist ?


Yes, it is.


There are no K{B,P}I policy for struct buf in HEAD, just change it as
it fits.


Which one would you prefer, the original or 
http://people.freebsd.org/~mav/buf_slist2.patch ?


Thank you.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: bsd boot sequence

2013-06-28 Thread David Noel
On 6/27/13, Brian Kim briansa...@gmail.com wrote:
 howdy all,

 As a junior computer engineering major who has dreams of developing an
 operating system more ubiquitous than ms windows, I have come to appreciate
 the complexity and elegance of both the freebsd os and community. While
 achieving proficiency in C programming yet lacking in both UNIX utility
 know-how and shell scripting, I would rate my hacker savvy to be on an
 intermediate level. By the time I graduate from college, I hope to have a
 thorough understanding of the bsd kernel on the lowest level.

 Don't get me wrong though: while my end-game may be to develop my own
 operating system, I will forever be a contributor to freebsd. As my
 understanding of computers develops, you will undoubtedly see my name
 appear more frequently on the freebsd-hacker emails. Unfortunately, I am
 just not at the level of understanding just yet.

 To start, I was wondering if someone could briefly explain the operations
 and function calls that occur at boot time. I wish to thoroughly examine
 the freebsd source code but I'm afraid the sheer volume of code that exists
 leaves me with no real starting point for my research.

 Thank you all for your time.

 --
 Best Wishes,
 Brian Kim

Boot's probably not as bad as it seems. There's boot0, boot1, boot2,
loader, and kernel. So those would probably be the best places to
start digging. A few hundred k of code if I'm not mistaken. Not
entirely unmanageable.

Some ideas that might help:

- Scan source for comments and function/method names
- Use doxygen to generate high-level source documentation
- Generate UML diagrams from source (class and sequence diagrams would
probably be most useful)
- As Warren suggested, read all the documentation you can find
- Google
- Take notes as you go, publish them on a blog, and share the URL with us!

-David
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IPv6 - sobind fails with 49

2013-06-28 Thread Andrey V. Elsukov
On 28.06.2013 08:15, Sreenivasa Honnur wrote:
 if (rv != 0) {
 
 printf(sock create ipv6 %s failed %d.\n,
 
 tbuf, rv);
 
 return NULL;
 
 }
 
 
Can you try insert here?
  bzero(saddr6, sizeof(saddr6));
 saddr6.sin6_family = AF_INET6;


-- 
WBR, Andrey V. Elsukov
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Adrian Chadd
.. i'd rather you narrow down _why_ it's performing better before committing it.

Otherwise it may just creep up again after someone does another change
in an unrelated part of the kernel.

You're using instructions-retired; how about using l1/l2 cache loads,
stores, etc? There's a lot more CPU counters available.

You have a very cool problem to solve. If I could reproduce it locally
I'd give you a hand.

Thanks,



-adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Alexander Motin

On 28.06.2013 18:14, Adrian Chadd wrote:

.. i'd rather you narrow down _why_ it's performing better before committing it.


If you have good guesses -- they are welcome. All those functions are so 
small, that it is hard to imagine how congestion may happen there at 
all. I have strong feeling that lock spinning there consumes 
incomparably much more CPU time then the locked region itself could consume.



Otherwise it may just creep up again after someone does another change
in an unrelated part of the kernel.


Big win or small, TAILQ is still heavier then STAILQ, while it is not 
needed there at all.



You're using instructions-retired; how about using l1/l2 cache loads,
stores, etc? There's a lot more CPU counters available.


I am using unhalted-cycles, that is more reasonable then 
instructions-retired. What's about other counters, there are indeed a 
lot of them, but it is not always easy to get something useful out of them.



You have a very cool problem to solve. If I could reproduce it locally
I'd give you a hand.


You'd need a lot of hardware and patches to reproduce it in full. But if 
you like to see this with your own eyes, I can give you an SSH access to 
my test machine.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Adrian Chadd
On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote:
 Otherwise it may just creep up again after someone does another change
 in an unrelated part of the kernel.

 Big win or small, TAILQ is still heavier then STAILQ, while it is not needed
 there at all.

You can't make that assumption. I bet that if both pointers are in the
_same_ cache line, the overhead of maintaining a double linked list is
trivial. You've already bitten the overhead of loading the struct from
RAM into L1/L2 cache; once that's done, the second pointer operation
should just hit the cache.

Same with the store - if they're in the same cache line, the overhead
will be very small. The memory controller is likely operating on
multiples of L1 cache lines anyway.

The only time I can see that being a benefit (on current hardware) is
if you're trying to pack more into cache lines; but then if you're
doing that, you're likely better off with an array rather than lists.
That way you don't waste 4/8 (or 2x that for the double-linked list)
bytes for each pointer for each list element, as well as another 4/8
byte pointer to 'self'. But that doesn't -always- give a boost; it
also depends upon memory layout and whether you're using all the cache
lines or not.

So yes - I think you're on to something and I think it's worth digging
deeper into what's going on before you commit something. No, don't
commit the SLIST change until you absolutely, positively know why it's
actually being helpful. It's way way too voodoo right now.

Thanks,


Adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Konstantin Belousov
On Fri, Jun 28, 2013 at 08:14:42AM -0700, Adrian Chadd wrote:
 .. i'd rather you narrow down _why_ it's performing better before committing 
 it.
 
 Otherwise it may just creep up again after someone does another change
 in an unrelated part of the kernel.
Or penalize some other set of machines where this is currently not a problem.

The cause should be identified before any change is committed.
 
 You're using instructions-retired; how about using l1/l2 cache loads,
 stores, etc? There's a lot more CPU counters available.
 
 You have a very cool problem to solve. If I could reproduce it locally
 I'd give you a hand.
 
 Thanks,
 
 
 
 -adrian


pgpY5K5ioAQsN.pgp
Description: PGP signature


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread mdf
On Fri, Jun 28, 2013 at 8:56 AM, Adrian Chadd adr...@freebsd.org wrote:

 On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote:
  Otherwise it may just creep up again after someone does another change
  in an unrelated part of the kernel.
 
  Big win or small, TAILQ is still heavier then STAILQ, while it is not
 needed
  there at all.

 You can't make that assumption. I bet that if both pointers are in the
 _same_ cache line, the overhead of maintaining a double linked list is
 trivial.


No, it's not.  A singly-linked SLIST only needs to modify the head of the
list and the current element.  A doubly-linked LIST needs to modify both
the head as well as the old first element, which may not be in cache (and
may not be in the same TLB, either).  I don't recall exactly what [S]TAILQ
touches, but the doubly-linked version still has to modify more entries
that are not contiguous.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Alexander Motin

On 28.06.2013 18:56, Adrian Chadd wrote:

On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote:

Otherwise it may just creep up again after someone does another change
in an unrelated part of the kernel.


Big win or small, TAILQ is still heavier then STAILQ, while it is not needed
there at all.


You can't make that assumption. I bet that if both pointers are in the
_same_ cache line, the overhead of maintaining a double linked list is
trivial. You've already bitten the overhead of loading the struct from
RAM into L1/L2 cache; once that's done, the second pointer operation
should just hit the cache.

Same with the store - if they're in the same cache line, the overhead
will be very small. The memory controller is likely operating on
multiples of L1 cache lines anyway.

The only time I can see that being a benefit (on current hardware) is
if you're trying to pack more into cache lines; but then if you're
doing that, you're likely better off with an array rather than lists.
That way you don't waste 4/8 (or 2x that for the double-linked list)
bytes for each pointer for each list element, as well as another 4/8
byte pointer to 'self'. But that doesn't -always- give a boost; it
also depends upon memory layout and whether you're using all the cache
lines or not.


That is true, but still TAILQ_INSERT/_REMOVE modify one more cache line 
then SLIST and also use branching.



So yes - I think you're on to something and I think it's worth digging
deeper into what's going on before you commit something. No, don't
commit the SLIST change until you absolutely, positively know why it's
actually being helpful. It's way way too voodoo right now.


OK, I'll dig it more.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: expanding amd64 past the 1TB limit

2013-06-28 Thread Konstantin Belousov
On Wed, Jun 26, 2013 at 10:11:44AM -0600, Chris Torek wrote:
 diff --git a/conf/options.amd64 b/conf/options.amd64
 index 90348b7..f3ce505 100644
 --- a/conf/options.amd64
 +++ b/conf/options.amd64
 @@ -1,6 +1,7 @@
  # $FreeBSD$
  # Options specific to AMD64 platform kernels
  
 +AMD64_HUGE   opt_global.h

Is this option needed ? The SMAP is already parsed at the time of
pmap_bootstrap() call, so you could determine the amount of physical
memory and size the KVA map accordingly ?


pgpuR4X3THSPE.pgp
Description: PGP signature


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Adrian Chadd
On 28 June 2013 09:18,  m...@freebsd.org wrote:

 You can't make that assumption. I bet that if both pointers are in the
 _same_ cache line, the overhead of maintaining a double linked list is
 trivial.


 No, it's not.  A singly-linked SLIST only needs to modify the head of the
 list and the current element.  A doubly-linked LIST needs to modify both the
 head as well as the old first element, which may not be in cache (and may
 not be in the same TLB, either).  I don't recall exactly what [S]TAILQ
 touches, but the doubly-linked version still has to modify more entries that
 are not contiguous.

Good point.



-adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: expanding amd64 past the 1TB limit

2013-06-28 Thread Chris Torek
[combining two messages and adding kib and alc to cc per Oliver Pinter]

 the CPU's CR4 on entry to the kernel.
It is %cr3.

Oops, well, easily fixed. :-)

 (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E
 entries into the new pmap, the L3 pages would not have to be
 physically contiguous, but the KVA ones would still all have to
 exist.  It's free to allocate physically contiguous pages here
 anyway though.)
I do not see how the physical continuity of the allocated page table
pages is relevant there.

Not in create_pagetables(), no, but later in pmap_pinit(), which has
loops to set pmap-pm_pml4[x] for kernel and and direct-map.  And:

Copying the L4 or L3 PTEs would cause serious complications.

Perhaps what I wrote was a little fuzzy.  Here's the pmap_pinit()
code I was referring to, as modified (the original version has only
the second loop -- it assumes NKPML4E is always 1 so it just sets
pml4[KPML4I]):

pmap-pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pml4pg));

if ((pml4pg-flags  PG_ZERO) == 0)
pagezero(pmap-pm_pml4);

for (i = 0; i  NKPML4E; i++) {
pmap-pm_pml4[KPML4BASE + i] = (KPDPphys + (i  PAGE_SHIFT)) |
PG_RW | PG_V | PG_U;
}
for (i = 0; i  NDMPML4E; i++) {
pmap-pm_pml4[DMPML4I + i] = (DMPDPphys + (i  PAGE_SHIFT)) |
PG_RW | PG_V | PG_U;
}

/* install self-referential address mapping entry(s) */

These require that KPDPphys and DMPDPphys both point to the first
of n physically-contiguous pages.  But suppose we did this (this
is deliberately simple for illustration, and furthermore I am
assuming here that vmspace0 never acquires any user-level L4
entries):

pmap-pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pml4pg));

/* Clear any junk and wire in kernel global address entries. */
bcopy(vmspace0.vm_pmap.pm_pml4, pmap-pm_pml4, NBPG);

/* install self-referential address mapping entry(s) */

Now whatever we set up in create_pagetables() is simply copied to
new (user) pmaps, so we could go totally wild if we wanted. :-)

 So, the last NKPML4E slots in KPML4phys point to the following
 page tables, which use all of L3, L2, and L1 style PTEs.  (Note
 that we did not need any L1 PTEs for the direct map, which always
 uses 2MB or 1GB super-pages.)
This is not quite true. In the initial state, indeed all PTEs for direct
map are superpages, either 1G or 2M. But Intel states that a situation
when the physical page has mappings with different caching modes causes
undefined behaviour. As result, if a page is remapped with non-write
back caching attributes, the direct map has to demote the superpage and
adjust the mapping attribute of the page frame for the page.

Yes, this particular bit of description was restricted to the setup
work in create_pagetables().

(Perhaps I should take out always, or substitute initially?)

Also, I think I left out a description of the loop where some KPDphys
entries are overwritten with 2MB mappings.

 +AMD64_HUGE  opt_global.h

Is this option needed ? The SMAP is already parsed at the time of
pmap_bootstrap() call, so you could determine the amount of physical
memory and size the KVA map accordingly ?

Mostly I was afraid of the consequences on VM_MIN_KERNEL_ADDRESS,
which is #included so widely, and any complaints people might have
about:

  - wasting NKPML4E-2 (i.e., 14) pages on small AMD64 systems (for
the new empty L3 pages in KPDphys that will likely not be used);

  - wasting yet another page because dynamic memory will start
at the first new L3 page (via KPML4BASE) instead of just using
the KMPL4I'th one because VM_MIN_KERNEL_ADDRESS is now at -8TB
instead of -.5TB -- with VM_MIN_KERNEL_ADDRESS at -.5TB, all
KVAs use the single KMPL4I'th slot;

  - wasting 30 more pages because NDMPML4E grew from 2 to 32; and

  - adding a loop to set up NKPML4E entries in every pmap, instead
of the single shove KPDphys into one slot code that used to
be there, and making the pmap_pinit loop run 32 times instead
of just 2 for the direct map.

Adding these up, the option chews up 45 pages, or 180 kbytes, when
compared to the current setup (1 TB direct map, .5 TB kernel VM).
180 kbytes is pretty trivial if you're planning to have a couple
of terabytes of RAM, but on a tiny machine ... of course if it's
that tiny you could run as i386, in 32 bit mode. :-)

If we copied the kernel's L4 table to new pmaps -- or even just
put in a new ndmpdpphys variable -- we could avoid allocating
any pages for DMPDPphys that we know won't actually be used.  That
would fix the 30 extra pages above, and even regain one page on
many amd64 setups (those with = 512 GB).  We'd be down to just 14
extra pages = 56 kbytes, and the new loop in pmap_pinit().  Here's
prototype code for sizing DMPDPphys, for illustration:

old:DMPDPphys = allocpages(firstaddr, 

Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Alexander Motin

On 28.06.2013 09:57, Konstantin Belousov wrote:

On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote:

While doing some profiles of GEOM/CAM IOPS scalability, on some test
patterns I've noticed serious congestion with spinning on global
pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is
already very simple, I've tried to optimize probably the only thing
possible there: switch bswlist from TAILQ to SLIST. As I can see,
b_freelist field of struct buf is really used as TAILQ in some other
places, so I've just added another SLIST_ENTRY field. And result
appeared to be surprising -- I can no longer reproduce the issue at all.
May be it was just unlucky synchronization of specific test, but I've
seen in on two different systems and rechecked results with/without
patch three times.

This is too unbelievable.  Could it be, e.g. some cache line conflicts
which cause the trashing, in fact ?


I think it indeed may be a cache trashing. I've made some profiling for 
getpbuf()/relpbuf() and found interesting results. With patched kernel 
using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in 
relpbuf() -- first lock acquisition causes 78% of them. Later memory 
accesses including the lock release are hitting the same cache line and 
almost free. With clean kernel using TAILQ I see RESOURCE_STALLS.ANY 
spread almost equally between lock acquisition, bswlist access and lock 
release. It looks like the cache line is constantly erased by something.


My guess was that patch somehow changed cache line sharing. But several 
checks with nm shown that, while memory allocation indeed changed 
slightly, in both cases content of the cache line in question is 
absolutely the same, just shifted in memory by 128 bytes.


I guess the cache line could be trashed by threads doing adaptive 
spinning on lock after collision happened. That trashing increases lock 
hold time and even more increases chance of additional collisions. May 
be switch from TAILQ to SLIST slightly reduces lock hold time, reducing 
chance of cumulative effect. The difference is not big, but in this test 
this global lock acquired 1.5M times per second by 256 threads on 24 
CPUs (12xL2 and 2xL3 caches).


Another guess was that we have some bad case of false cache line 
sharing, but I don't know how that can be either checked or avoided.


At the last moment mostly for luck I've tried to switch pbuf_mtx from 
mtx to mtx_padalign on clean kernel. For my surprise that also seems 
fixed the congestion problem, but I can't explain why. 
RESOURCE_STALLS.ANY still show there is cache trashing, but the lock 
spinning has gone.


Any ideas about what is going on there?

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Adrian Chadd
On 28 June 2013 15:15, Alexander Motin m...@freebsd.org wrote:

 I think it indeed may be a cache trashing. I've made some profiling for
 getpbuf()/relpbuf() and found interesting results. With patched kernel using
 SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in relpbuf()
 -- first lock acquisition causes 78% of them. Later memory accesses
 including the lock release are hitting the same cache line and almost free.
 With clean kernel using TAILQ I see RESOURCE_STALLS.ANY spread almost
 equally between lock acquisition, bswlist access and lock release. It looks
 like the cache line is constantly erased by something.

Can you narrow down the resource stall check to each of the sub-types?
See which one/ones it is?


-adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Konstantin Belousov
On Sat, Jun 29, 2013 at 01:15:19AM +0300, Alexander Motin wrote:
 On 28.06.2013 09:57, Konstantin Belousov wrote:
  On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote:
  While doing some profiles of GEOM/CAM IOPS scalability, on some test
  patterns I've noticed serious congestion with spinning on global
  pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is
  already very simple, I've tried to optimize probably the only thing
  possible there: switch bswlist from TAILQ to SLIST. As I can see,
  b_freelist field of struct buf is really used as TAILQ in some other
  places, so I've just added another SLIST_ENTRY field. And result
  appeared to be surprising -- I can no longer reproduce the issue at all.
  May be it was just unlucky synchronization of specific test, but I've
  seen in on two different systems and rechecked results with/without
  patch three times.
  This is too unbelievable.  Could it be, e.g. some cache line conflicts
  which cause the trashing, in fact ?
 
 I think it indeed may be a cache trashing. I've made some profiling for 
 getpbuf()/relpbuf() and found interesting results. With patched kernel 
 using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in 
 relpbuf() -- first lock acquisition causes 78% of them. Later memory 
 accesses including the lock release are hitting the same cache line and 
 almost free. With clean kernel using TAILQ I see RESOURCE_STALLS.ANY 
 spread almost equally between lock acquisition, bswlist access and lock 
 release. It looks like the cache line is constantly erased by something.
 
 My guess was that patch somehow changed cache line sharing. But several 
 checks with nm shown that, while memory allocation indeed changed 
 slightly, in both cases content of the cache line in question is 
 absolutely the same, just shifted in memory by 128 bytes.
 
 I guess the cache line could be trashed by threads doing adaptive 
 spinning on lock after collision happened. That trashing increases lock 
 hold time and even more increases chance of additional collisions. May 
 be switch from TAILQ to SLIST slightly reduces lock hold time, reducing 
 chance of cumulative effect. The difference is not big, but in this test 
 this global lock acquired 1.5M times per second by 256 threads on 24 
 CPUs (12xL2 and 2xL3 caches).
 
 Another guess was that we have some bad case of false cache line 
 sharing, but I don't know how that can be either checked or avoided.
 
 At the last moment mostly for luck I've tried to switch pbuf_mtx from 
 mtx to mtx_padalign on clean kernel. For my surprise that also seems 
 fixed the congestion problem, but I can't explain why. 
 RESOURCE_STALLS.ANY still show there is cache trashing, but the lock 
 spinning has gone.
 
 Any ideas about what is going on there?

FWIW, Jeff just changed pbuf_mtx allocation to use padalign, it is a
somewhat unrelated change in r252330.

Are pbuf_mtx and bswlist are located next to next in your kernel ?

If yes, then I would expect that the explanation is how the MESI
protocol and atomics work. When performing the locked op, CPU takes the
whole cache line into the exclusive ownership. Since our locks try the
cmpset as the first operation, and then 'adaptive' loop interleaving
cmpset and check for the ownership, false cache line sharing between
pbuf_mtx and bswlist should result exactly in such effects.  Different
cores should bounce the ownership of the cache line, slowing down
the accesses.

AFAIR Intel exports some MESI events as performance counters, but I was
not able to find a reference in the architecturally defined events. It
seems to be present in the model-specific part.


pgpzr8SbYK8oQ.pgp
Description: PGP signature