Re: [osol-code] sfmmu spinning

Gavin Maltby Fri, 04 May 2007 03:19:22 -0700

Hi Peter,

On 05/02/07 13:37, Peter Tribble wrote:

There's this interesting comment about line 3235 of hat_sfmmu.c


            /*
             * Hblk_hmecnt and hblk_vcnt could be non zero
             * since hblk_unload() does not gurantee that.
             *
             * XXX - this could cause tteload() to spin
             * where sfmmu_shadow_hcleanup() is called.
               */

(Seehttp://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/sfmmu/vm/hat_sfmmu.c#3235)


I have a system (V440, running S10U3) that has an unkillable process
consuming a whole cpu that appears to be doing just this. At least,
dtrace tells me the stack traces are all around:

             unix`sfmmu_free_hblks+0x34
             unix`sfmmu_shadow_hcleanup+0x58
             unix`sfmmu_free_hblks+0x1b0
             unix`sfmmu_shadow_hcleanup+0x58
             unix`sfmmu_tteload_find_hmeblk+0x1f0
             unix`sfmmu_tteload_array+0x40
             unix`hat_memload_array+0x184
             genunix`segvn_fault_vnodepages+0x116c
             genunix`segvn_fault+0x3f0
             genunix`as_fault+0x4c8
             unix`pagefault+0xac
             unix`trap+0xd44
             unix`utl0+0x4c


Were you able to determine whether we're looping within that stack,
and from which level?  Or are we taking repeated pagefaults on that
thread?

OK. How do I confirm the nature of the problem? Is there anything
I can do about it? How can I get this fixed?

(Yes, I am going to log a support call - in fact I already have one open
for this system, and this problem might be realted. But I am interested
in learning more and digging a little deeper.)


This part of the sfmmu code is exceptionally complex (sfmmu is complex at
its best!) and steeped in history.  I don't recommend digging too deep unless
you're into pain!  For those who are ...

Since this is a user trap we're dealing with the user hme hash.  It's arranged
the same way as the kernel one, but sized differently.  These hashes keep
track of all established mappings to physical memory.  Given an address space
(hatid), virtual address and mapped page size (8K, 64K, ...) we first compute
a hash to select a hash bucket; in this hashing 8K and 64K pages are
actually not distinguished, since 8K pages are managed in groups of 8
for a span of 64K.  The entries in the bucket are "hme blocks" which
house either 8 x 8K "hm ents" or 1 x large page "hm ent".  Within the hment
is included the TTE (translation table entry) to be loaded into the TLB
during missing handling.  The stack above is establishing a new mapping
entry (pagefault) and inserting it into the hash.

So the above works when you have a (hatid, address, mapping-size) tuple
to lookup in the hash.  But there is also the question of mapping all
the individual mappings of one physical page (ie the same physical page
mapped into multiple address spaces, eg libc pages).  The list of
mappings of a page is anchored in the page_t for that physical page,
and is a doubly-linked list of hments already embedded in the hmeblks
described above.  Some operations, such as pageunload, need to run
across the list of all mappings of a page and operate on them (eg
unload the mapping).

I picture the hash as a grid array of hash buckets across the columns heads
at the top, chains of hmeblks running vertically down the columns from those
buckets each with 1 or 8 hments, page_t's at the beginning of each row
anchoring the p_mapping list and horizontal strings of hments anchored
off of each page_t and tieing hments together.

Now the locking of this hash is a nightmare, because it so performance-
sensitive.  During insert/remove etc of hmeblks on a bucket we hold the bucket 
lock.
However during p_mapping list traversals we do not acquire the bucket lock
of each hash chain as we traverse over it.  Instead those operations
on strings of hments work out for each hment what is the associated
hmeblk, and then perform atomic decrement operations when invalidating
a TTE (on hblk_hmecnt and hme_vcnt mentioned in the code snippet above).
Thus the hash bucket lock does not protect those members of
hmeblks on that bucket, and code manipulating those chains needs
to detect cases in which it has raced with a pageunload operation.
This is the joy of the convoluted code in sfmmu_hblk_unload.

In the hash there are also "shadow hmeblks".  These do not contain
hments, but instead contain a bitmask describing which VA ranges at
the next page size step down include valid mappings.  So a 4MB
shadow block will have a bit for each 64K chunk of addresses
within that 4MB range, and if the bit is set it means that
somewhere in that range is a valid mapping.  This is used
in operations such as when unmapping a large range of
virtual addresses for a process - stepping through the
range 8K at a time and looking up each mapping takes forever,
especially if large chunks of the range aren't actually mapped.

The hash management is often lazy/delayed.  For instance when we
invalidate the last hment of an hmeblk we don't always remove the
hmeblk from the chain, we leave that for the next chain traversal
to do.  So there is a fair amount of dead wood present, being cleaned
up as others traverse the hash.  So when you lookup an hmeblk
it may not actually truly be in use, or it may previously have
been used as an 64K block and now intended for use as an 8 x 8K
block (they use the same hash value for 8K/64K), or it may have
been used as a shadow block in the past and needs rehabilitation
before regular reuse.

So in your stack sfmmu_tteload_array is adding a new mapping entry.
It has locked the hash bucket on which this mapping will
reside (for the given as, addr, size) and calls to sfmmu_tteload_find_hmeblk
to either find an existing hmeblk for this address space and VA range
on this bucket, or to allocate a new one if not already present.
In your case sfmmu_tteload_find_hmeblk found a match already present
on the hash chain.  In the common case this means we're adding
a further 8K mapping to one of the 8 entries in an 8 x 8K block.
But we may also have found some dead wood that we need to
rehabilitate - a block previously used at a different mapping
size (of 8k/64k) which we should free and then begin our search
again, or a block previously used as a shadow hmeblk and which
still has shadow blocks for smaller shadows below it which again
we must cleanup and then start again.  That's what has happened
in your stack, since we have called sfmmu_shadow_hcleanup.

sfmmu_shadow_hcleanup drops the bucket lock and calls sfmmu_free_hblks
to free all hblks for this as for the address range covered by the
initial hblk.  It returns to sfmmu_tteload_find_hmeblk after
reacquiring the bucket lock, and the find code starts
the search afresh (and it could still get a hit since
another thread could have established the mapping by now).

Finally we get to sfmmu_free_hblks, which can recurse into
sfmmu_shadow_hcleanup for sub-shadow blocks of the original.
This steps over the address range in chunks of the mapping
size and cleans up any possible hblks in that range;
for each such sub-chunk it must lock the associated hash
and traverse it looking for hmeblks with mappings for
this as at the indicated page size and for the given
VA chunk.

Now it is possible that the hmeblk has only just recently
been unloaded by the previous user (eg process just exiting/exited).
Since sfmmu_hblk_unload contends with sfmmu_pageunload as described
above, there are cases where sfmmu_hblk_unload returns having left
some detected collisions with pageunload to complete their work,
ie the hblk will be freeable real soon now - as soon as the
pageunload completes and atomically decrements the hblk_vcnt and hblk_hmecnt
to 0.  But the pageunload may have been interrupted or whatever, so we
can't be sure exactly when it will complete its operation.
So there is a small chance that back in sfmmu_shadow_hcleanup we could
still see the hblk_cnt and hblk_hmecnt non-zero, which is the comment
you have highlighted.  In that case we do not remove the block from
the hash chain, and proceed on to the next chunk of addresses.
When we unwind back to the sfmmu_tteload_find_hmeblk it could
find this same block again and cause us to descend through the
whole process again.  This won't happen indefinitely since
sooner or later the pageunload will be unpinned and will
decrement the counters allowing us to free the block.

That's how it should work.  If you are spinning in that
stack at and below sfmmu_tteload_find_hmeblk then chances
are that some mismanagement of the hmecnt/vcnt has happened.
This may be verified through some post-mortem crash
analysis.  Some time back I logged some potentially
relevant bugs here and tested fixes to them, but the
code was never changed.  These are

4854805 dead code in sfmmu_hblk_unload()
4854825 sfmmu_hblk_unload() inconsistent in sync'ing with pageunload

Here are my comments from the first (I'm xxxx):

----- start CR quote -----
[EMAIL PROTECTED] 2003-04-28

sfmmu_hblk_unload() contains a case that is impossible - it can
be removed and make things clearer.  See comments for details.

Comments:

[EMAIL PROTECTED] 2003-04-28

In sfmmu_hblk_unload:

        while (addr < endaddr) {
                pml = NULL;
again:
--> A                sfmmu_copytte(&sfhmep->hme_tte, &tte);
--> B                if (TTE_IS_VALID(&tte)) {
--> C                        pp = sfhmep->hme_page;
                        if (pp && pml == NULL) {
--> D                                pml = sfmmu_mlist_enter(pp);
                        }

                        /*
                         * Verify if hme still points to 'pp' now that
                         * we have p_mapping lock.
                         */
--> E                        if (sfhmep->hme_page != pp) {
--> F                                if (pp != NULL && sfhmep->hme_page != 
NULL) {
                                        if (pml) {
                                                sfmmu_mlist_exit(pml);
                                        }
                                        /* Re-start this iteration. */
                                        continue;
                                }
                                ASSERT((pp != NULL) &&
                                    (sfhmep->hme_page == NULL));
                                goto tte_unloaded;
                        }

Suppose the stores of an sfmmu_pageunload all become visible between
C & D.  So our local copy of the tte was valid at B and we still saw
hme_page non-NULL at C and therefore grabbed the relevant p_mapping
mutex at D (at which point all stores of the pageunload would have
to now be visible to us).

Knowing we race with pageunload we recheck hme_page at E.  Now the
only possible change for hme_page is from non-NULL to NULL ... in the
hme_sub of the pageunload we raced with.  It is impossible for hme_page
to change to a different and still non-NULL value as we hold the hashbucket
lock which would be required to do that.

Hence the condition at F will always be false ... if we first saw the
hme_page non-NULL and it has changed it *must* now be NULL and so the
second component sfhmep->hme_page != NULL will always fail.

We therefore never take the "restart this iteration" path but always
goto tte_unloaded.

----- end CR -----

That bug really just describes the code, there is no defect as such.

Here's the second one:

----- start CR quote -----
Description:

[EMAIL PROTECTED] 2003-04-28

In on10 a new case in hblk_unload() synchronizes with pageunload
if we find we're racing with it (via the p_mapping mutex for the page).
But there is another path through hblk_unload() where we have also detected
a race with pageunload where we do not syncronise with pageunload.

Comments:

[EMAIL PROTECTED] 2003-04-28

An outline of hblk_unload for a single hment:

        while (addr < endaddr) {
                pml = NULL;
                sfmmu_copytte(&sfhmep->hme_tte, &tte);
--> A                if (TTE_IS_VALID(&tte)) {
                        if (pp && pml == NULL) {
                                pml = sfmmu_mlist_enter(pp);
                        }

                        /*
                         * Verify if hme still points to 'pp' now that
                         * we have p_mapping lock.
                         */
                        if (sfhmep->hme_page != pp) {
                                if (pp != NULL && sfhmep->hme_page != NULL) {
                                        if (pml) {
                                                sfmmu_mlist_exit(pml);
                                        }
                                        /* Re-start this iteration. */
                                        continue;
                                }
                                ASSERT((pp != NULL) &&
                                    (sfhmep->hme_page == NULL));>>> 1           
            goto tte_unloaded;
                        }
                ...     
                ...
                        ttemod = tte;
                        TTE_SET_INVALID(&ttemod);
                        ret = sfmmu_modifytte_try(&tte, 
&ttemod,&sfhmep->hme_tte);

                        if (ret <= 0) {
                                if (TTE_IS_VALID(&tte)) {
                                        goto again;
                                } else {
                                        /*
                                         * We read in a valid pte, but it
                                         * is unloaded by page_unload.
                                         * hme_page has become NULL and
                                         * we hold no p_mapping lock.
                                         */
                                        ASSERT(pp == NULL && pml == NULL);>>> 2 
                               goto tte_unloaded;
                                }
                        }


--> B                } else if ((pp = sfhmep->hme_page) != NULL) {
                                /*
                                 * Tte is invalid but the hme
                                 * still exists. let pageunload
                                 * complete its job.
                                 */
                                ASSERT(pml == NULL);
                                pml = sfmmu_mlist_enter(pp);
                                if (sfhmep->hme_page != NULL) {
                                        sfmmu_mlist_exit(pml);
                                        pml = NULL;
                                        goto again;
                                }
                                ASSERT(sfhmep->hme_page == NULL);
--> C                }  else if (hmeblkp->hblk_hmecnt != 0) {
                        /*
                         * pageunload may have not finished decrementing
                         * hblk_vcnt and hblk_hmecnt. Find page_t if any and
                         * wait for pageunload to finish. Rely on pageunload
                         * to decrement hblk_hmecnt after hblk_vcnt.
                         */
                        pfn_t pfn = TTE_TO_TTEPFN(&tte);
                        ASSERT(pml == NULL);
                        if (pf_is_memory(pfn)) {
                                pp = (machpage_t *)page_numtopp_nolock(pfn);
                                if (pp != NULL) {
                                        pml = sfmmu_mlist_enter(pp);
                                        sfmmu_mlist_exit(pml);
                                        pml = NULL;
                                }
                        }
                }
tte_unloaded:
                        if (pml) {
                        sfmmu_mlist_exit(pml);
                }

                addr += TTEBYTES(ttesz);
                sfhmep++;
                DEMAP_RANGE_NEXTPG(dmrp, 0);
        }

As explained in recently logged bug 4854805, the "restart this iteration"
case in block A can never be reached.  That may explain the following flaw.

The most common route to block C is:

        - pageunload at work and racing with us, some of it's stores already
          visible at the point we copy the tte above.
        - so our copy of the tte is seen as already invalidated at A
        - at B imagine that hme_page is NULL (pageunload invalidates the tte,
          then calls hme_sub later which will clear hme_page so this is
          possible given TSO and/or the membar_stst in hme_sub)
        - at C it is still very possible/probable that the hmecnt is nonzero.
          It's possible even if we've raced with pageunload on what was the
          last valid hment that appeared on a p_mapping list.  But it's
          much more likely that we raced with pageunload when there were
          still several hments on p_mapping lists.

The code in block C seems concerned that all stores of the pageunload that we
raced with are now visible (if we can see the last one we can see them all in
TSO).  So it forces this by grabbing and immediately releasing the
p_mapping lock for the pp reconstructed from the tte (provided it is a
memory page).

If we detect the race with pageunload at point >>> 1 above (the continue
case is impossible) we have fully synchronised with pageunload because
this is a case where pp was non NULL and we have performed a p_mapping lock
enter.

But we can also detect the race with pageunload at point >>> 2 above.
This is the case where we read pp as NULL so didn't enter the mutex
(the hme_page could have been null because this hment was for an IO page).
We only notice the race we we try to invalidate the tte ourselves and
find somebody has already done it (>>> 2) and we goto tte_unloaded.

The bug is that this goto skips over the hmecnt check.

[EMAIL PROTECTED] 2003-04-28

OK, a bit more background read (deltas 1.114 and 1.191, bugs
1228622 and 1231320 are good) tells us why we sync in B above
and I believe it the same reason we sync at C above.  I think
the extra case C is protecting against a similar reuse of hblks
while a pageunload is not quite finished.  In this case the above
is defintely a bug as it's not protecting all cases.

[EMAIL PROTECTED] 2003-04-29

Continuing the monologue I believe it is the following that this new
case is protecting against:

static struct hme_blk *
sfmmu_tteload_find_hmeblk(sfmmu_t *sfmmup, struct hmehash_bucket *hmebp,
        caddr_t vaddr, uint_t size)
{
        hmeblk_tag hblktag;
        int hmeshift;
        struct hme_blk *hmeblkp, *pr_hblk;
        uint64_t hblkpa, prevpa;

        hblktag.htag_id = sfmmup;
        hmeshift = HME_HASH_SHIFT(size);
        hblktag.htag_bspage = HME_HASH_BSPAGE(vaddr, hmeshift);
        hblktag.htag_rehash = HME_HASH_REHASH(size);

ttearray_realloc:

        HME_HASH_SEARCH_PREV(hmebp, hblktag, hmeblkp, hblkpa, pr_hblk, prevpa);
        if (hmeblkp == NULL) {
                hmeblkp = sfmmu_hblk_alloc(sfmmup, vaddr, hmebp, size, hblktag);
        } else {
                /*
                 * It is possible for 8k and 64k hblks to collide since they
                 * have the same rehash value. This is because we
                 * lazily free hblks and 8K/64K blks could be lingering.
                 * If we find size mismatch we free the block and & try again.
                 */
                if (get_hblk_ttesz(hmeblkp) != size) {
                        ASSERT(!hmeblkp->hblk_vcnt);
                        ASSERT(!hmeblkp->hblk_hmecnt);
                        sfmmu_hblk_hash_rm(hmebp, hmeblkp, prevpa, pr_hblk);
                        sfmmu_hblk_free(hmebp, hmeblkp, hblkpa);
                        goto ttearray_realloc;
                }
        ...
}

Imagine this process had just hblk_unloaded a 8K page, and then maps
again at that same virtual address (or close enough to with an hblk span)
for a 64K page.  Imagine that the hblk_unload returned with one of the
hments yet tobe completed by a pageunload - the final hmecnt atomic decrement
still outstanding.  As such the caller of hblk unload would not have
sfmmu_hblk_free'd the hblk (hmecnt still nonzero) and would have
left it on the hash to be lazily freed later.

The new tteload has hashed to the same bucket and when we walk that
bucket above in HME_HASH_SEARCH_PREV we'll find the hmeblk to which
pageunload still has an outstanding decrement.  But we unceremoniously
remove that hblk from the hash and free it.  We could immediately reuse
it - possibly even at the sfmmu_hblk_alloc above when we goto ttearray_realloc
(this is likely if it was a "nucleus" hmeblk that we free).  After we
start using it, the pageunload is finally unpinned and decrement the
hmecnt - disaster looms.

That origin may make sense with this having arrived in the MPSS project.
----- end of CR -----

It all made a lot more sense to me then.

Cheers

Gavin
_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Re: [osol-code] sfmmu spinning

Reply via email to