Re: New SMP Mbuf Allocator (PATCH and TESTING request)

Bosko Milekic Thu, 14 Jun 2001 09:43:55 -0700

Hi Terry,

On Thu, Jun 14, 2001 at 01:03:13AM -0700, Terry Lambert wrote:
> 
> A general comment, and then a comment on the patch:
> 
> I would like to see this code be optional, until such time
> as benchmarks have proven it to be better than the existing
> code, or have proben it to be worse.  Diking out the existing
> code, and forcing it to be diked out by making unnecessary
> function name changes seems to me to be a problem.

        This is going to be a very difficult thing to do. Hear me out. :-)

        The present code has been designed without any SMP in mind and without
any concept of allowing for the possibility to have pages that were once
wired-down and put to use by the mbuf subsystem to being freed when no longer
used so that memory could be reclaimed. The new code attempts to offer both,
but as with almost anything, sacrifices must be made. I'm hesitant to accept
your `benchmark this' offer merely because it's difficult to accurately
quantify overall performance while it will be easy for the benchmark to detect
that the new code is in some ways slower than the old (because of additional
complexity) and therefore rule: old allocator > new allocator, which is false
given the ignorance of most benchmarks. I'll try my best to circle some of
the Pros and Cons of each allocation technique and try to give you an
overall picture of the tradeoffs and gains made.

 - The old allocator has one lock to protect all the free lists. As the number
   of CPUs grows, so does contention and, consequently, the mbuf allocator
   may become a significant bottleneck when several threads are servicing
   net-related requests. The new allocator has a lock per CPU thus allowing
   for more than one CPU to allocate mbufs and clusters at any one time,
   ultimately splitting contention in most cases. Additionally, the old
   allocator, in an SMP system, will likely have the lock (and list structure,
   etc.) ping-ponging from data cache to data cache whereas the new allocator,
   because of the separate lists and locks will minimize this effect.

 - The old allocator fragments memory and can never reclaim it and never
   easily determine if all the objects in one given page of data are free
   therefore if we were to ever decide that yes, it's now time to have the
   allocator free some pages back to the map, we won't be able to easily
   implement it (if at all). The new allocator has the "framework" for this
   type of thing so that if users continue requesting (as many have already
   done) to have FreeBSD free up resources it allocates for mbufs and clusters,
   we should be able to relatively easily implement freeing from a kproc, for
   example, and only do it, for example, when X number of bytes is in the pool
   but when only Y bytes are in use (where Y <<<< X). (Why from a kproc? Because
   freeing back to the map is relatively expensive for a thread that wants to
   merely free up an mbuf or cluster - this I *have* determined from previous
   profiling - results are probably archived somewhere on the lists).

 - The old allocator's macros MGET, MCLGET, MFREE, etc. could get away with
   making an allocation without performing a single function call. A function
   call was done only by those using the function equivalents m_get(),
   m_gethdr(), etc. The new allocator performs one function call in the
   common case for allocation. I don't foresee this to be a real problem -
   if it's any argument at all: NetBSD && OpenBSD have both been doing this
   for quite a long time (one function call per allocation) and in fact so
   have we before this allocator was introduced to replace calls to malloc()
   _and_ also as a result, the new allocator has shrunk those macros to
   effectively nothing in terms of size thus improving overall performance
   (less code cache pollution) whereas the old allocator has, needless to
   say, pretty large macros.

 - The common case of allocation and freeing in the new allocator is
   larger (although realistically, not significantly) than the common case
   for the old allocator. This is due to the added complexity in the new
   allocator and is expected. The increase in size of the common case is
   basically from `insignificant' to `insignificant + insignificant' when
   taking into account the fact that CPU speed is increasing incredibly
   quickly. Keep in mind that although it may take longer to execute the
   common case in the new allocator, the fact that there is less contention
   in the same common case may make it faster, overall.

 - The old allocator has all mbufs and clusters mixed in the same heap and
   there is no concept of `producer' and `consumer' in that although one
   thread may allocate an mbuf and write to it, and another thread may only
   free the mbuf, the second thread will, if the first thread is on another CPU,
   that CPU's data cache entries for the given mbuf because during freeing,
   the thread will write to the mbuf. Since in the new allocator mbufs are
   not written to when being freed and since they are always returned to the
   list from which they were obtained, cache invalidation is once again
   minimized.

  In conclusion, it is obvious that for strictly UP systems, the present
allocator may turn out to be faster. But, if you look at it that way, then
the mbuf allocator _BEFORE_ the whole SMPng thing started was EVEN better for
UP systems (no lock code, etc.) On the other hand, the elimination of huge
macros may even help improve overall performance even for UP systems.
  In the SMP case, the new allocator scales much much better.
  Both of these statements are obvious from the above points/comparisons.

  I hope this makes it clear that depending on the benchmark, we may see
completely different results. Keep in mind though that the new allocator also
offers the possibility of freeing back to the map, a significant advantage over
the old one, especially for servers with exceptionally high network loads (for
example, an IRC server may consume a lot of memory for mbufs and clusters at
one point, only then to drop to normal amounts later again. However, the fact
that the pages are never freed and are wired-down means that the system may
start swapping simply due to wastage on the mbuf and cluster free lists).

> Overall, I like the idea of moving towards a Dynix allocator
> type model to support many processors (not just 4) well, but
> I'm leery of many of the changes, particularly in light of
> the claim that "It has been established that we do not benefit
> from different locks for different objects, so we use the same
> lock, regardless of object type".

        Well, if you'll note that not long ago, there used to be three
different locks: one for mbufs, one for clusters, and one for counters.
However, with three separate locks and the fact that callers often do:
"get mbuf -> get counter -> get cluster" often resulted in one CPU doing:
"cache first lock -> cache second lock -> cache third lock" and then another
would have to invalidate the first CPUs cache three times (as opposed to
only once with one lock). As a result, ping-ponging from CPU to CPU to
CPU etc. often occured. With one lock, this is minimized. In the new allocator,
having separate locks is not really useful since we have one lock per CPU
anyway so that comment is not really an attempt at optimization but just
a statement of the obvious.
 
> On to the specific comments... I have addressed them in patch
> order, so that they can be read side-by-side:
> 
> --
> 
> The "npg" calculation has always been incorrect for large
> memory systems with non-sparse physical maps, and now it
> is much worse.  I commented on this before in -arch.
> 
> Consider the case of a machine with 4G of physical memory.

        Can you please clarify?
 
> --
> 
> Is the policy two line or 4 line license, these days?
> 
> --

        I don't know. How is this important?

> The MBALLOC_CPU stuff should be more dynamic; it appears
> that the real number of CPUs is known at the time the
> allocations take place.
> 
> --

        Yes, but some structures (notably statistics and the list manager
structures) are statically allocated (the stats are exported via sysctl and
need this) and so we need to know how many CPUs there are before compilation.
It doesn't really matter if we don't, though, because then we'll just end
up allocating slightly more than we need (we use MAXCPU in the default case)
and we'll be on with it. Note that this does *not* mean that we'll also end
up setting up MAXCPU CPU lists, as that setup is done at runtime when we
know really how many CPUs we have.
 
> I think you should spin, and only block if you don't get
> the spin lock after several attempts.  The Solaris tradeoff
> in this regard is 10 attempts.  Direct blocking on the cv
> seems a little wasteful.

        In the common case, I shouldn't have to even block. These are per-CPU
locks and blocking only happens when a different CPU is freeing an mbuf to
a first CPUs list. I don't think I'm doing anything wrong with using the
standard locking primitives provided by the OS. Spinlocks disable local
interrupts and that seems more wasteful than blocking. I could be wrong,
but these things are easy to change.

> I'm not sure why you mutex the per-CPU containers?  Aren't
> they, by definition, per-CPU, and therefore immune from
> contention?

        I mutex them for two reasons:

 1- A thread on another CPU may be freeing to the first CPU if the mbuf it
    is freeing originated from the first CPU (this is a characteristic of
    the allocator model).

 2- Preemption. 

> --
> 
> You realize that there is a dependency on MAXFILES, right?
> 
> Specifically, there is an mbuf allocated per open socket
> for the tcptmpl (even though it only needs 60 bytes).
> 
> ---
> 
> If you fix the tcptmpl allocation, a couple of additional
> issues arise:
> 
> 1)    Since the allocations are in page-units, you will
>       end up with a lot of waste.
> 
> 2)    Allocation in page units will make it hard to give
>       60 byte allocations back to the system.
> 
> 3)    The allocator is not really general enough for this
>       (or for things like TIME_WAIT zombie structs, etc.).
> 
> 4)    Yeah, this sort of also implies that mbuf cluster
>       headers come from a seperate pool, instead of being
>       256 bytes as well...

        Huh? I haven't changed anything in terms of the sizes of allocations.
The allocator only allocates mbufs and clusters, which are of fixed size in
order to ease allocations of the same object in page units (it's much faster
and simpler). I realize that some wastage occurs but that's just a fact of
BSD (and has been forever). The new allocator doesn't make it worse.

> --
> 
> Why do you lock around things in mb_init()?  It's never
> called twice, let alone reentrantly... the code will be
> done before an AP is allowed to run.  Alpha has this same
> restriction, from what I can see from the code.
> 
> --

        Simply because mb_pop_cont() expects it and because it's a consistent
thing to do with respect to the internal interface.
 
> Dropping the lock in mb_pop_cont() seems wrong; I guess
> you are avoiding sleeping with the lock held?
> 
> I don't think you need to worry (per-CPU, again), unless
> you later attempt lazy repopulation; it seems to me that
> the lock that should be held is a global lock, to prevent
> reentrancy on the global memory pool, and the malloc()
> code does that.
> 
> Perhaps I'm missing something here.

        Yeah. There have been some discussions as to why this is presently
necessary. In the future, it may not be required. But, basically, right now,
we can have Giant -> mbuf_related_lock and, if I don't drop the lock, then
also mbuf_related_lock -> Giant, which would be a lock order reversal.

> --
> 
> The ability to specify a "how" that implies an unwillingness
> to block implies interrupt allocation; yet you can't support
> this, since you do not preallocate a kernel map (ala the zone
> allocator).  The zone allocator is actually very misleading,
> in that some of it's code never has the values it implies that
> it might have; in particular, there are functions which are
> passed parameters which can't really be variant, as implied by
> the use of per zone variables as arguments.

        Although I don't quite understand everything you mention in this last
paragraph, the `how' parameter implies more than just willingness to block
in the new allocator. It also determines whether or not, during starvation,
the system will call the protocol drain routines and whether or not it will
attempt to "steal" from other CPU lists. Feel free to clarify, though, if it's
still a concern. 

> --
> 
> Ah.  I see why all the locks: mb_alloc_wait attempts to be
> a primitive reclaimer.  You would do much better with a per-CPU
> high-watermark reclaim at free(), I think, instead of doing
> this.  Really, you are in a starvation condition because of
> your load, and not because someone is "hogging already free
> mbufs", I think.  This is probably premature optimization.
> 
> --
> 
> In the starvation case during a free, I don't think you
> really care, except perhaps to adjust the high/low watermarks.
> 
> An interesting issue here is that the average TCP window size
> will end up being 16k, which comes out to 4 1-page buckets (2,
> on Alpha), where you end up with multiple buckets and many
> mbufs (64) per, so freeing isn't really an option, if you are
> sending lots of data (serving content).  It would be different
> for a desktop system, since the receive interrupt might come
> to any idle CPU.
> 
> --
> 
> I like the cleanup via the use of local macros.

        Thanks. So do I. :-)

> --
> 
> I think the m_getclr -> m_get_clrd change is gratuitous.

        m_getclr() isn't used at many places so it's not that big of a deal.
The reason I changed it is because it's confusing: m_getclr() sounds like
"m_get a cluster."

> --
> 
> I like the "only" comment claifications.
> 
> --
> 
> I dislike the bloating of one line comments into three lines,
> with the first and last blank.

        It happens once or twice, as far as I could see (I could be missing
some). It's to maintain consistency. The one-line comment likely should have
been a three-line comment to begin with.

> --
> 
> You could resolve the statistics problems by keeping them on
> a per-CPU basis, and then agregating them under a general
> system lock, when necessary (when they were requested).
> 
> --
> 
> Most of the header definition changes for function declarations
> are gratuitous.

        It's a question of alignment and style fix (to properly align the
function declarations as per style(9), I'm allowed to insert a space for
functions that do not return pointers).
 
> --
> 
> Don't really care about netstat stuff...
> 
> --
>
> -- Terry


-- 
 Bosko Milekic
 [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message
Re: New SMP Mbuf Allocator (PATCH and TESTING request)

Reply via email to