Garrett D'Amore wrote:
> Is this on SPARC or x86 hardware?
>
> It *sounds* sort of like it might be a problem with corruption of DMA.
> Make sure that when you do m_stop, you've really shut down your hardware
> including any DMA transfers *before* you yank the DMA mappings out from
> underneath it. (I can imagine in particular a DMA region getting
> reused, and if your device is still accessing that region, then problems
> could ensue.)
This is x86..
I switched to using a netperf fill file of /usr/dict/words, and that's
making it a bit easier to distinguish between something getting
zeroed, and something getting clobbered with what I'm sending.
Now I'm seeing a random splattering of a page or so of data from
that file to various locations.
The crash was a GPF:
> $C
ffffff0008181430 impl_acc_hdl_free+0x1e(ffffff01e1275680)
ffffff0008181460 ddi_dma_mem_free+0x2f(ffffff01f695ade8)
ffffff0008181480 myri10ge_dma_free+0x1d()
ffffff00081814c0 myri10ge_unprepare_tx_ring+0x4b()
ffffff00081814f0 myri10ge_teardown_slice+0x3f()
ffffff0008181530 myri10ge_stop_locked+0x6c()
ffffff0008181550 myri10ge_m_stop+0x6b()
ffffff0008181580 mac_stop+0x47(ffffff01d0082bb8)
<...>
The driver was tearing down its pre-mapped DMA transmit
buffers, and tripped over a ddi_dma_acc_handle that was
corrupted with the contents of /usr/dict/words:
> ffffff01e1275680::print ddi_acc_hdl_t
{
ah_vers = 0x6e6f4141
ah_bus_private = 0x676f6c6f6f7a0a6f
ah_platform_private = 0x5a0a6d6f6f7a0a79
<....>
Which is the garbage I was sending. Eg:
0xffffff01e1275680:
AAone\nzoo\nzoology\nzoom\nZorn\nZoroaster\nZoroastrian\nzounds\nz's\nzu
cch\nzippy\nzircon\nzirconium\nzloty\nzodiac\nzodiacal\nZoe\nZomba\nzombie\nzing\nzigzag\nzigzagging\nzi
lch\nZimmerman\nzinc\nzing\nZion\nZionism\nzip\nzenith\nzero\nzeroes\nzeroth\nzest\nzesty\nzeta\nZeus\nZ
iegler\nzig\nziggzap\nzazen\nzeal\nZealand\nzealot\nzealous\nzebra\nZeiss\nZellerbach\nZen
This continues on through the beginning of that page:
0xffffff01e1275000:
M\nacetone\nacetylene\nache\nachieve\nAchilles\naching\nachromatic\nacid
<....>
I don't think this is DMA related corruption, because the data is data
I'm sending. I should only be receiving acks.
As near as I can figure, there are 2 ways I could be screwing this up:
1) copying this data way out of bounds in my tx copy routine
2) somehow freeing the dma handle twice, and having it end up getting
allocated to a dblk, and copied over top of.
I'm leaning towards #2.
I've tried setting kmem_flags to 0xf and it doesn't seem to
have any change, except it seems to take longer to trigger
the crash..
Any ideas would be welcome. Is there any way I can map the
dma acc handles read-only, so that if somebody touches them
the system will panic?
Thanks,
Drew
_______________________________________________
networking-discuss mailing list
[email protected]