I'm working on a fix for this CR:

  6203568 accumulating number of threads behind rw_lock in
          ire_walk_ill_tables IRB_REFRELE

If you have comments, concerns, or ideas, please let me know -- either
directly or here on this mailing list.

The current design is that each object (ire, nce, ill, ipif) has a
hash table, indexed on kthread_t pointer.  The entries in that hash
table are arrays with a rotating set of event records and protected by
ire_lock, nce_lock, and ill_lock (both ill and ipif).

When something happens with an object, you just look in the hash table
based on curthread, allocate a new entry if none is there, and insert
the event.

On thread exit, though, we have to do something ugly.  Since we don't
know where the per-thread records are (they're all off on the
individual objects), we need to do a walk of every ire, nce, ill, and
ipif in the system, lock it, and walk through every event record on
the hash bucket to find the one to remove.  The work involved (in
terms of lock takes/drops and other effort) is something like:

        N_threads * N_objects * N_ipthreads * ( 1 / N_buckets )

... where N_buckets is 64, N_ipthreads is some unknown fraction of
N_threads, and both N_threads and N_objects can be quite large -- tens
or hundreds of thousands.  We're looking at roughly O(N^3) effort, if
we assume that big machines have lots of threads exiting and lots of
objects.  It's good that the optimization is on the frequent
operations, but bad that the "infrequent" ones are so common and so
disasterous.

After doing some testing, I think I've got a better design for this.
First, we set up some thread-specific-data for IP.  This per-thread
entry is a pointer to a modhash.  Since the pointer for each object
(ire, nce, ill, ipif) is unique, we don't need separate hashes.  A
single one (per thread actively using IP) will do the job.

Having a per-thread modhash means that we don't have to worry about a
global lock for the table.  This makes updates easy.

On thread exit, all that we need to do is destroy the modhash, and
this will call back and destroy the elements.  On destruction of the
elements, we verify that all of the implied locks (ref counts) are
gone and free up.

On object destruction, we need to walk the list of IP-using threads to
remove elements.  This is easier than walking all of the objects,
though, because we can link them into a per-IP-stack list, observing
that when an object is being destroyed, no other threads can be
accessing it anymore (it must have been quiesced), and we're in charge
of modhash allocation.

So, the new design involves a th_hash_t that looks like this:

typedef struct th_hash_s {
        list_node_t     thh_link;
        mod_hash_t      *thh_hash;
        ip_stack_t      *thh_ipst;
} th_hash_t;

This is the per-thread storage.  The rest of the structures stay
mostly the same, but the #ifdef XXX_DEBUG stuff and local arrays all
go away from each of the ill_t, ire_t, nce_t, and ipif_t structures.
The data required is kept external (in the hash entries).

Because this uses sys/list.h and modhash, debug becomes much easier.
I'll add an mdb walker that gives you all of the modhash addresses
(for use with the existing modhash debug code in mdb), and a
::th_thread dcmd that works like this:

  If given an address of an ill_t, ipif_t, ire_t, or nce_t, print the
  corresponding th_trace_t structure in detail.  Otherwise, if no address is
  given, then walk all IP stacks and summarize all th_trace_t structures
  with non-zero th_refcnt.
  
  Options:
        -n      display only entries with non-zero th_refcnt

Example usage (on a dump where I intentionally added a panic right
after doing IRE_REFHOLD):

> ::th_trace
         IPSTACK           OBJECT            TRACE   REFCNT           THREAD
     30002c3a000      30002fa1498      30004cd1000        0      2a10066bca0
     30002c3a000      30003d1a908      30004ccd000        0      2a10066bca0
     30002c3a000      300030210e8      30004757000        0      2a10066bca0
     30002c3a000      30003021428      3000473b000        0      2a10066bca0
[...]
> ::th_trace -n
         IPSTACK           OBJECT            TRACE   REFCNT           THREAD
     30002c3a000      30002fa1d68      3000ef83000        1      30003cf2d60
> 30002fa1d68::th_trace -n
Object 30002fa1d68 in IP stack 30002c3a000:
        Thread 30003cf2d60 refcnt 1:
          T+0:
                ire_trace_ref+0x10c
                ip_output_options+0x8e0
                ip_output_v6+0x280
                tcp_send_data+0x1ec
                tcp_rput_other+0x654
                squeue_enter+0x1ec
                putnext+0x3f4
                strput+0x1d8
                kstrputmsg+0x3d0
                sotpi_connect+0x854
                connect+0xc0
          T+0:
                ire_untrace_ref+0x90
                tcp_send_find_ire_ill+0x16c
                tcp_send_data+0x16c
                tcp_rput_other+0x654
                squeue_enter+0x1ec
                putnext+0x3f4
                strput+0x1d8
                kstrputmsg+0x3d0
[...]
> 30002fa1d68::whatis                 
30002fa1d68 is 30002fa1d68+0, bufctl 300029c1b98 allocated from ire_cache
> 

As you can see, this shows the trace entries in reverse time order
with the computed lbolt delta.  That first event at the top is the
holder of the reference that existed at panic time.

More details (including test results, showing that the new code is not
just simpler but faster, and webrevs and diffs) to follow.

-- 
James Carlson, Solaris Networking              <[EMAIL PROTECTED]>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to