David Gwynne wrote:
> hrm. could you try it without your diff below and see if its still stable?
In light of what Fred has said, I double-checked this with fresh builds
of today's -CURRENT.
Firstly with only your patch and not mine, I can reliably reproduce
the crash in under 30 seconds by doing these things to the machine in
parallel:
while ssh -vC sparc64 'true' ; do sleep 1 ; done
ssh -vC sparc64 'ping -f 10.0.0.1'
ssh -vC sparc64 'cd /usr/src/sys && cvs up -Pd'
That's with the current in-page header threshold (kern/subr_pool.c:261)
of:
} else if (sizeof(struct pool_item_header) * 2 >= size) {
which I think is 224 bytes on this machine.
The behaviour of subr_pool.c prior to v1.149 was a threshold of
size < PAGE_SIZE/16; 512 bytes on sparc64 but 256 on many others.
I could still reproduce the bug with:
} else if (256 >= size) {
which creates mbufpl having 32 * 256-byte items per hardware page.
But I could not reproduce the bug once I changed it to:
} else if (256 > size) {
where mbufpl has 31 items per hardware page to make room for the in-page
header at the end.
So if it was tweaked to:
} else if (sizeof(struct pool_item_header) * 4 >= size) {
it would probably hide the bug we're seeing, but I'd still like to
figure out what the problem is exactly.
> my theory is dc is (was?) sensitive to the layout of objects in
> memory, so by moving the pool page headers in or out of the item pages
> you're moving dc next to something that ends up causing the iommu to
> fault.
It is quite odd that by *having* the in-page header, avoids the bug.
(So it's not that something doesn't expect it to be there and overwrites
it, for example).
Whilst we don't know for sure the crash is related to mbufpl, I feel
certain given the above, that it is one of the 256-byte pools. It
likely only happens when Nout>31 or is close to a multiple of 32, so
here are some numbers from DDB right after reproducing the crash:
crash#1 crash#2 crash#3
syncache Nout= 0 (so can be ruled out)
mbufpl Nout= 79 80 80
bufpl Nout= 1332 1372 1306
vmsppl Nout= 28 28 28
dma256 Nout= 0 (so can be ruled out)
Then I got lucky and had a stack trace that actually implicated mbufs:
panic: psycho0: uncorrectable DMA error AFAR 6e868250 (pa=0 tte=0/60024012)
AFSR 410000ff40800000
Stopped at Debugger+0x8: nop
TID PID UID PRFLAGS PFLAGS CPU COMMAND
*13904 13904 0 0x100032 0 0 ping
psycho_ue(400008a3200, 62, 2, 400009c2050, 4000fb35700, 4000fb353e0) at
psycho_ue+0x7c
intr_handler(e0017ec8, 400008a3300, 1bea4, 800, ffffffffffff9116, ffff) at
intr_handler+0xc
sparc_interrupt(1400, 400090d7300, 1693968, 20, 0, 40009112fc0) at
sparc_interrupt+0x298
pool_put(1896128, 400090d7300, a000001, 4000fb32000, 400090d7fac, 40009112fc0)
at pool_put+0x30c
m_free(400090d7300, 9, 400090d7300, 0, 0, 40008ea26d0) at m_free+0x9c
m_freem(400090d7300, 400090d7300, 4000fb35b90, 0, 0, 0) at m_freem+0xc
sendit(0, 3, 0, 0, 4000fb35df0, 5e17c1e3aa4aec64) at sendit+0x2bc
sys_sendto(40008ea26d0, 4000fb35db0, 4000fb35df0, 35e528535b505347,
517bcf55b2212657, 14b) at sys_sendto+0x68
syscall(4000fb35ed0, 485, 33bd70a838, 33bd70a83c, 0, 0) at syscall+0x34c
softtrap(3, 33bdd49124, 40, 0, 33bdd5b164, 10) at softtrap+0x19c
Where 1896128 is mbufpl, and 400090d7300 is 4864 bytes / 19 items
into the current page:
ddb> show pool /p 1896128
POOL mbufpl: size 256 maxcolors 1
alloc 0x180e820
minitems 16, minpages 1, maxpages 128, npages 4
itemsperpage 32, nitems 128, nout 80, hardlimit 4294967295
nget 105828, nfail 0, nput 105748
npagealloc 4, npagefree 0, hiwat 4, nidle 1
empty page list:
page 0x40008e7e000, color 0x40008e7e000, nmissing 0
full page list:
page 0x400091aa000, color 0x400091aa000, nmissing 32
page 0x400090d8000, color 0x400090d8000, nmissing 32
partial-page list:
page 0x400090d6000, color 0x400090d6000, nmissing 16
curpage 0x400090d6000
The mbuf header may not hold valid data anymore, but here it seemed to
have type MT_SONAME:
ddb> show mbuf 400090d7300
mbuf 0x400090d7300
m_type: 3 m_flags: 0
m_next: 0x0 m_nextpkt: 0x0
m_data: 0x400090d7320 m_len: 16
m_dat: 0x400090d7320 m_pktdat: 0x400090d7368
Regards,
--
Steven Chamberlain
[email protected]