David Gwynne wrote:
> hrm. could you try it without your diff below and see if its still stable?

In light of what Fred has said, I double-checked this with fresh builds
of today's -CURRENT.

Firstly with only your patch and not mine, I can reliably reproduce
the crash in under 30 seconds by doing these things to the machine in
parallel:
  while ssh -vC sparc64 'true' ; do sleep 1 ; done
  ssh -vC sparc64 'ping -f 10.0.0.1'
  ssh -vC sparc64 'cd /usr/src/sys && cvs up -Pd'

That's with the current in-page header threshold (kern/subr_pool.c:261)
of:
        } else if (sizeof(struct pool_item_header) * 2 >= size) {
which I think is 224 bytes on this machine.

The behaviour of subr_pool.c prior to v1.149 was a threshold of
size < PAGE_SIZE/16;  512 bytes on sparc64 but 256 on many others.

I could still reproduce the bug with:
        } else if (256 >= size) {
which creates mbufpl having 32 * 256-byte items per hardware page.

But I could not reproduce the bug once I changed it to:
        } else if (256 > size) {
where mbufpl has 31 items per hardware page to make room for the in-page
header at the end.

So if it was tweaked to:
        } else if (sizeof(struct pool_item_header) * 4 >= size) {
it would probably hide the bug we're seeing, but I'd still like to
figure out what the problem is exactly.

> my theory is dc is (was?) sensitive to the layout of objects in
> memory, so by moving the pool page headers in or out of the item pages
> you're moving dc next to something that ends up causing the iommu to
> fault.

It is quite odd that by *having* the in-page header, avoids the bug.
(So it's not that something doesn't expect it to be there and overwrites
it, for example).

Whilst we don't know for sure the crash is related to mbufpl, I feel
certain given the above, that it is one of the 256-byte pools.  It
likely only happens when Nout>31 or is close to a multiple of 32, so
here are some numbers from DDB right after reproducing the crash:

                        crash#1 crash#2 crash#3
  syncache      Nout=   0 (so can be ruled out)
  mbufpl        Nout=   79      80      80
  bufpl         Nout=   1332    1372    1306
  vmsppl        Nout=   28      28      28
  dma256        Nout=   0 (so can be ruled out)

Then I got lucky and had a stack trace that actually implicated mbufs:

panic: psycho0: uncorrectable DMA error AFAR 6e868250 (pa=0 tte=0/60024012) 
AFSR 410000ff40800000
Stopped at      Debugger+0x8:   nop
   TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
*13904  13904      0    0x100032          0    0  ping
psycho_ue(400008a3200, 62, 2, 400009c2050, 4000fb35700, 4000fb353e0) at 
psycho_ue+0x7c
intr_handler(e0017ec8, 400008a3300, 1bea4, 800, ffffffffffff9116, ffff) at 
intr_handler+0xc
sparc_interrupt(1400, 400090d7300, 1693968, 20, 0, 40009112fc0) at 
sparc_interrupt+0x298
pool_put(1896128, 400090d7300, a000001, 4000fb32000, 400090d7fac, 40009112fc0) 
at pool_put+0x30c
m_free(400090d7300, 9, 400090d7300, 0, 0, 40008ea26d0) at m_free+0x9c
m_freem(400090d7300, 400090d7300, 4000fb35b90, 0, 0, 0) at m_freem+0xc
sendit(0, 3, 0, 0, 4000fb35df0, 5e17c1e3aa4aec64) at sendit+0x2bc
sys_sendto(40008ea26d0, 4000fb35db0, 4000fb35df0, 35e528535b505347, 
517bcf55b2212657, 14b) at sys_sendto+0x68
syscall(4000fb35ed0, 485, 33bd70a838, 33bd70a83c, 0, 0) at syscall+0x34c
softtrap(3, 33bdd49124, 40, 0, 33bdd5b164, 10) at softtrap+0x19c

Where 1896128 is mbufpl, and 400090d7300 is 4864 bytes / 19 items
into the current page:

ddb> show pool /p 1896128
POOL mbufpl: size 256 maxcolors 1
        alloc 0x180e820
        minitems 16, minpages 1, maxpages 128, npages 4
        itemsperpage 32, nitems 128, nout 80, hardlimit 4294967295

        nget 105828, nfail 0, nput 105748
        npagealloc 4, npagefree 0, hiwat 4, nidle 1

        empty page list:
                page 0x40008e7e000, color 0x40008e7e000, nmissing 0

        full page list:
                page 0x400091aa000, color 0x400091aa000, nmissing 32
                page 0x400090d8000, color 0x400090d8000, nmissing 32

        partial-page list:
                page 0x400090d6000, color 0x400090d6000, nmissing 16
        curpage 0x400090d6000

The mbuf header may not hold valid data anymore, but here it seemed to
have type MT_SONAME:

ddb> show mbuf 400090d7300
mbuf 0x400090d7300
m_type: 3       m_flags: 0
m_next: 0x0     m_nextpkt: 0x0
m_data: 0x400090d7320   m_len: 16
m_dat: 0x400090d7320    m_pktdat: 0x400090d7368

Regards,
-- 
Steven Chamberlain
[email protected]

Reply via email to