On Wed, Dec 23, 2009 at 11:43:40AM -0800, Brandeburg, Jesse wrote: > On Tue, 22 Dec 2009, Brandon Philips wrote: > > On 11:20 Thu 10 Dec 2009, Tantilov, Emil S wrote: > > > >> I am trying to test the patches you submitted (thanks btw) and so > > > >> far am not able to reproduce the panic you described. When MTU is at > > > >> 1500 RCTL.LPE (bit 5) is set to 0 and the HW will not allow the > > > >> reception of large packets (>1522 bytes, which is what rx_buffer_len > > > >> is set to). This is basically what I am seeing in my tests - packets > > > >> are discarded by the HW. > > > > I have a memory dump from an SLE10 SP3 machine that seems to reproduce > > this issue. The testing environment was netperf with the MTU being > > switched every minute from 9000 -> 1500 and it took 40 hours to hit > > the bug. So, an overnight test, as you tried, may not be enough. > > Thanks for testing Brandon, I think your test (with e1000e 1.0.2.5) is > significantly different than the test that Neil started with. That said I > think it is a valuable test and we are going to start a test today that > uses pktgen on two machines to send 64 byte and 9014 byte packets to a > host that is changing its MTU every 5-10 seconds. > > > In the memory dump there are 6 skb's in the ring that have memory > > overwritten from skb->data to skb->data + 2048. The machine ended up > > oopsing in skb_release_data() from e1000_clean_all_rx_rings() from > > e1000_change_mtu(). > > I think we should put a patch like the below into the kernel and actually > *catch* any overrun DMAs even on a production machine. At that point we > could even leak that skb memory to prevent the corrupted memory from > making its way into the general environment. > > > 35:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet > > Controller (Copper) (rev 06) > > what kind of system is this that had such a high bus number? PPC64? > > > Subsystem: Intel Corporation PRO/1000 PT Quad Port LP Server Adapter > > Kernel driver in use: e1000 > > Kernel modules: e1000 > > e1000_change_mtu could possibly have a race that would allow corruption if > all receives were not completed in the time we waited (10 millseconds) for > some reason, but only if LPE was cleared already. I still think your test > is significantly different and maybe showing a different (but similar) > edge case bug than Neil. > > shouldn't IOMMU systems be catching this too when it was occuring? we > only call pci_map_* with buffer_info->length which is assigned > rx_buffer_len. > > e1000/e1000e: check rx length for overruns > > From: Jesse Brandeburg <[email protected]> > > it has been reported that some tests can cause DMA overruns resulting in > corrupted memory. If the hardware writes more data to memory than we had > allocated this is something we can check for. > > For now, WARN_ON, with the future capability of doing something like leaking > the memory rather than returning a known corrupt buffer to userspace. > > Signed-off-by: Jesse Brandeburg <[email protected]> > CC: [email protected] > CC: [email protected]
I think this seems like a reasonable idea. Additionally (or perhaps alternatively), it might be a good idea to (when allocating buffers in the default setup path), expand the allocation size by a word, that we write a cannary value into. Then we can check that cannary on the napi poll should the hardware dma past the end of the skb. Neil > ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel
