Re: [E1000-devel] Memory Corruption with e1000
Sorry, I noted on the earlier post I only did a single reply, rather than a reply-all. On Thu, Jun 6, 2013 at 4:37 PM, Ronciak, John john.ronc...@intel.com wrote: We have some ideas and are working on a patch for you to try. Since we won't really be able to test it can you do that if we get it to you? Do you know how to patch a driver? Or should we send you the whole thing (a complete new driver like you would get off of our SF site)? I can apply the patch. Will the patch be based upon the 7.3.21 that in in 3.0.80? Or the newer 8.0.35? Thanks, Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] Memory Corruption with e1000
On 06/05/2013 08:34 PM, Peter LaDow wrote: On 6/5/13, Ronciak, John john.ronc...@intel.com wrote: So I have a couple of questions. Does this happen with a non-preemptive kernel? I understand that you probably need to use a preemptive kernel but for testing purposes it would be good to know. We don't always test with preemptive kernels. Hmmm... If you mean no RT patches, then yes. On a vanilla 3.0.80 kernel. What about the pre-emption behavior of the kernel? Namely Processor type and Features - Preemption Model. Are you using no preemption, or forced preemption? -PJ When doing the up/down transitions is there system under test? I mean sending and receiving packets? If it is what is the load like? Does changing the load make a difference? Does stopping the network traffic first make a difference in the outcome? Yes, the load makes a difference. On a silent network (or no link at all) this does not occur. Our network is quite busy. It isn't sending much (perhaps DHCP discovers and some IPv6 stuff). Thanks, Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] Memory Corruption with e1000
On 6/6/13, Peter P Waskiewicz Jr peter.p.waskiewicz...@intel.com wrote: What about the pre-emption behavior of the kernel? Namely Processor type and Features - Preemption Model. Are you using no preemption, or forced preemption? It is PREEMPT_FULL. I'll turn it off and give it a spin. Thanks, Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] Memory Corruption with e1000
On Thu, Jun 6, 2013 at 12:30 AM, Peter P Waskiewicz Jr peter.p.waskiewicz...@intel.com wrote: What about the pre-emption behavior of the kernel? Namely Processor type and Features - Preemption Model. Are you using no preemption, or forced preemption? Ok. I've done testing. Yes, we were building with PREEMPT_FULL. I've done some further testing and can re-create the problem on vanilla, non-preempt kernels. See below. # uname -a Linux (none) 3.0.80-rt108 #2 Thu Jun 6 16:09:35 UTC 2013 ppc GNU/Linux And I still get the slab corruption leading up to the kernel panic: Slab corruption: size-2048 start=ee2b2070, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [c0208514](skb_release_data+0xb4/0xc8) 020: 6b 6b ff ff ff ff ff ff 00 0d ed 47 d9 87 81 00 030: 00 f2 08 06 00 01 08 00 06 04 00 01 00 0d ed 47 040: d9 87 0a f1 0a ea 00 00 00 00 00 00 0a f1 0a ea 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 060: 00 00 09 81 d2 0f 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=ee2b2888, len=2048 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [c0209b8c](__netdev_alloc_skb+0x28/0x60) 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Slab corruption: size-2048 start=ed401480, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [c0208514](skb_release_data+0xb4/0xc8) 020: 6b 6b ff ff ff ff ff ff e0 db 55 e4 ce f9 08 00 030: 45 00 01 3e 3e 1a 00 00 80 11 ca c0 0a ca 0d 42 040: 0a ca 0d ff 00 8a 00 8a 01 2a a5 96 11 0e af 81 050: 0a ca 0d 42 00 8a 01 14 00 00 20 45 42 45 4f 45 060: 45 46 43 45 4c 45 50 45 44 45 49 45 4f 45 43 43 070: 41 43 41 43 41 43 41 43 41 41 41 00 20 46 44 45 Prev obj: start=ed400c68, len=2048 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [c0209b8c](__netdev_alloc_skb+0x28/0x60) 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Unable to handle kernel paging request for data at address 0x20454c45 Faulting instruction address: 0xc0062498 Oops: Kernel access of bad area, sig: 11 [#1] SEL35xx Platform Modules linked in: NIP: c0062498 LR: c02084d8 CTR: c000cbbc REGS: ee85bc60 TRAP: 0300 Not tainted (3.0.80-rt108) MSR: 9032 EE,ME,IR,DR CR: 24008248 XER: DAR: 20454c45, DSISR: 2000 TASK = ef3e5830[4616] 'ifconfig' THREAD: ee85a000 GPR00: ee85bd10 ef3e5830 20454c45 2d746baa 05f2 0002 GPR08: c03b14e4 ed7471a8 ee85bcd0 5c26 10087a48 bfe0e41c 10064ae4 GPR16: 10064bc0 bfe0e40c bfe0e3f4 0228 8914 c019a488 GPR24: c019a9cc ed70f4b0 005c ed70f340 ef063120 0001 ee62bd30 NIP [c0062498] put_page+0x0/0x34 LR [c02084d8] skb_release_data+0x78/0xc8 Call Trace: [ee85bd20] [c020810c] __kfree_skb+0x18/0xbc [ee85bd30] [c0195734] e1000_clean_rx_ring+0x10c/0x1a4 [ee85bd60] [c01957f4] e1000_clean_all_rx_rings+0x28/0x54 [ee85bd70] [c0198d40] e1000_close+0x30/0xb4 [ee85bd90] [c0212408] __dev_close_many+0xa0/0xe0 [ee85bda0] [c02141a0] __dev_close+0x2c/0x4c [ee85bdc0] [c0210a58] __dev_change_flags+0xb8/0x140 [ee85bde0] [c0212324] dev_change_flags+0x1c/0x60 [ee85be00] [c0267594] devinet_ioctl+0x2a4/0x700 [ee85be60] [c026839c] inet_ioctl+0xc8/0xfc [ee85be70] [c02006d4] sock_ioctl+0x260/0x2a0 [ee85be90] [c009145c] vfs_ioctl+0x2c/0x58 [ee85bea0] [c0091bc8] do_vfs_ioctl+0x610/0x698 [ee85bf10] [c0091ca8] sys_ioctl+0x58/0x88 [ee85bf40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 Instruction dump: 419e0018 3c80c006 38630180 38842abc 38a0 4bfffe65 80010014 bbc10008 38210010 7c0803a6 4e800020 4b54 8003 7c691b78 700bc000 41a20008 Kernel panic - not syncing: Fatal exception Call Trace: [ee85bb90] [c0007b80] show_stack+0x58/0x154 (unreliable) [ee85bbd0] [c001c3a8] panic+0xa8/0x1cc [ee85bc20] [c000b1f0] die+0x178/0x19c [ee85bc40] [c0011a44] bad_page_fault+0xe8/0xfc [ee85bc50] [c000eb14] handle_page_fault+0x7c/0x80 --- Exception: 300 at put_page+0x0/0x34 LR = skb_release_data+0x78/0xc8 [ee85bd10] [] (null) (unreliable) [ee85bd20] [c020810c] __kfree_skb+0x18/0xbc [ee85bd30] [c0195734] e1000_clean_rx_ring+0x10c/0x1a4 [ee85bd60] [c01957f4] e1000_clean_all_rx_rings+0x28/0x54 [ee85bd70] [c0198d40] e1000_close+0x30/0xb4 [ee85bd90] [c0212408] __dev_close_many+0xa0/0xe0 [ee85bda0] [c02141a0] __dev_close+0x2c/0x4c [ee85bdc0] [c0210a58] __dev_change_flags+0xb8/0x140 [ee85bde0] [c0212324] dev_change_flags+0x1c/0x60 [ee85be00] [c0267594] devinet_ioctl+0x2a4/0x700 [ee85be60] [c026839c] inet_ioctl+0xc8/0xfc [ee85be70] [c02006d4] sock_ioctl+0x260/0x2a0 [ee85be90] [c009145c] vfs_ioctl+0x2c/0x58 [ee85bea0] [c0091bc8] do_vfs_ioctl+0x610/0x698 [ee85bf10] [c0091ca8] sys_ioctl+0x58/0x88 [ee85bf40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 And with a vanilla, no-preempt kernel: # uname -a Linux (none) 3.0.80 #5 Thu Jun 6 16:26:15 UTC 2013 ppc GNU/Linux slab error in
Re: [E1000-devel] Memory Corruption with e1000
On Thu, 6 Jun 2013 09:38:50 -0700 Peter LaDow pet...@gocougs.wsu.edu wrote: On Thu, Jun 6, 2013 at 12:30 AM, Peter P Waskiewicz Jr peter.p.waskiewicz...@intel.com wrote: What about the pre-emption behavior of the kernel? Namely Processor type and Features - Preemption Model. Are you using no preemption, or forced preemption? Ok. I've done testing. Yes, we were building with PREEMPT_FULL. I've done some further testing and can re-create the problem on vanilla, non-preempt kernels. See below. # uname -a Linux (none) 3.0.80-rt108 #2 Thu Jun 6 16:09:35 UTC 2013 ppc GNU/Linux And I still get the slab corruption leading up to the kernel panic: Slab corruption: size-2048 start=ee2b2070, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [c0208514](skb_release_data+0xb4/0xc8) 020: 6b 6b ff ff ff ff ff ff 00 0d ed 47 d9 87 81 00 that is quite clearly a broadcast, seems to me maybe a vlan packet 0x8100 to maybe vlan 0xf2? so this means that the receive unit of the e1000 is not being stopped completely (or is restarted by something) but that the memory of the DMA buffer (the 2kB allocation) is being freed and then still DMA'd to. 030: 00 f2 08 06 00 01 08 00 06 04 00 01 00 0d ed 47 040: d9 87 0a f1 0a ea 00 00 00 00 00 00 0a f1 0a ea 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 060: 00 00 09 81 d2 0f 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=ee2b2888, len=2048 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [c0209b8c](__netdev_alloc_skb+0x28/0x60) 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Slab corruption: size-2048 start=ed401480, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [c0208514](skb_release_data+0xb4/0xc8) 020: 6b 6b ff ff ff ff ff ff e0 db 55 e4 ce f9 08 00 030: 45 00 01 3e 3e 1a 00 00 80 11 ca c0 0a ca 0d 42 same thing here, but this is an IP packet. this is clearly a network adapter putting frames into memory that has been freed. I will see if someone here can reproduce this issue, but it seems quite clear what is happening, we just need to figure out why. 040: 0a ca 0d ff 00 8a 00 8a 01 2a a5 96 11 0e af 81 050: 0a ca 0d 42 00 8a 01 14 00 00 20 45 42 45 4f 45 060: 45 46 43 45 4c 45 50 45 44 45 49 45 4f 45 43 43 070: 41 43 41 43 41 43 41 43 41 41 41 00 20 46 44 45 Prev obj: start=ed400c68, len=2048 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [c0209b8c](__netdev_alloc_skb+0x28/0x60) 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Unable to handle kernel paging request for data at address 0x20454c45 Faulting instruction address: 0xc0062498 Oops: Kernel access of bad area, sig: 11 [#1] SEL35xx Platform Modules linked in: NIP: c0062498 LR: c02084d8 CTR: c000cbbc REGS: ee85bc60 TRAP: 0300 Not tainted (3.0.80-rt108) MSR: 9032 EE,ME,IR,DR CR: 24008248 XER: DAR: 20454c45, DSISR: 2000 TASK = ef3e5830[4616] 'ifconfig' THREAD: ee85a000 GPR00: ee85bd10 ef3e5830 20454c45 2d746baa 05f2 0002 GPR08: c03b14e4 ed7471a8 ee85bcd0 5c26 10087a48 bfe0e41c 10064ae4 GPR16: 10064bc0 bfe0e40c bfe0e3f4 0228 8914 c019a488 GPR24: c019a9cc ed70f4b0 005c ed70f340 ef063120 0001 ee62bd30 NIP [c0062498] put_page+0x0/0x34 LR [c02084d8] skb_release_data+0x78/0xc8 Call Trace: [ee85bd20] [c020810c] __kfree_skb+0x18/0xbc [ee85bd30] [c0195734] e1000_clean_rx_ring+0x10c/0x1a4 [ee85bd60] [c01957f4] e1000_clean_all_rx_rings+0x28/0x54 [ee85bd70] [c0198d40] e1000_close+0x30/0xb4 [ee85bd90] [c0212408] __dev_close_many+0xa0/0xe0 [ee85bda0] [c02141a0] __dev_close+0x2c/0x4c [ee85bdc0] [c0210a58] __dev_change_flags+0xb8/0x140 [ee85bde0] [c0212324] dev_change_flags+0x1c/0x60 [ee85be00] [c0267594] devinet_ioctl+0x2a4/0x700 [ee85be60] [c026839c] inet_ioctl+0xc8/0xfc [ee85be70] [c02006d4] sock_ioctl+0x260/0x2a0 [ee85be90] [c009145c] vfs_ioctl+0x2c/0x58 [ee85bea0] [c0091bc8] do_vfs_ioctl+0x610/0x698 [ee85bf10] [c0091ca8] sys_ioctl+0x58/0x88 [ee85bf40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 Instruction dump: 419e0018 3c80c006 38630180 38842abc 38a0 4bfffe65 80010014 bbc10008 38210010 7c0803a6 4e800020 4b54 8003 7c691b78 700bc000 41a20008 Kernel panic - not syncing: Fatal exception Call Trace: [ee85bb90] [c0007b80] show_stack+0x58/0x154 (unreliable) [ee85bbd0] [c001c3a8] panic+0xa8/0x1cc [ee85bc20] [c000b1f0] die+0x178/0x19c [ee85bc40] [c0011a44] bad_page_fault+0xe8/0xfc [ee85bc50] [c000eb14] handle_page_fault+0x7c/0x80 --- Exception: 300 at put_page+0x0/0x34 LR = skb_release_data+0x78/0xc8 [ee85bd10] [] (null) (unreliable) [ee85bd20] [c020810c] __kfree_skb+0x18/0xbc [ee85bd30] [c0195734] e1000_clean_rx_ring+0x10c/0x1a4 [ee85bd60] [c01957f4] e1000_clean_all_rx_rings+0x28/0x54
Re: [E1000-devel] Memory Corruption with e1000
On Thu, Jun 6, 2013 at 11:23 AM, Ronciak, John john.ronc...@intel.com wrote: I agree with Jesse but this driver has been in the field for a very long time with no reports like this coming to us. Can you send us the dmesg when this is happening? I want to see if there are messages from the driver like if the down is being delayed somehow. Or re-enabled. I stripped out the up/down messages. But yes, there are sometimes up messages. At the end is the complete dmesg output. I've tweaked the script to print whenever the interface is changed. It appears that the slab errors are when the interface comes down: Bringing eth2 up... e1000: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX ADDRCONF(NETDEV_UP): eth2: link is not ready ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready Tearing eth2 down... slab error in verify_redzone_free(): cache `size-2048': memory outside object was overwritten Call Trace: [ee275c70] [c0007b80] show_stack+0x58/0x154 (unreliable) [ee275cb0] [c007bb0c] __slab_error+0x2c/0x3c [ee275cc0] [c007c0d0] cache_free_debugcheck+0x184/0x274 [ee275cf0] [c007c36c] kfree+0x90/0x10c [ee275d10] [c02079e4] skb_release_data+0xb4/0xc8 [ee275d20] [c02075dc] __kfree_skb+0x18/0xbc [ee275d30] [c0194d50] e1000_clean_rx_ring+0x10c/0x1a4 [ee275d60] [c0194e10] e1000_clean_all_rx_rings+0x28/0x54 [ee275d70] [c019835c] e1000_close+0x30/0xb4 [ee275d90] [c02118d8] __dev_close_many+0xa0/0xe0 [ee275da0] [c0213670] __dev_close+0x2c/0x4c [ee275dc0] [c020ff28] __dev_change_flags+0xb8/0x140 [ee275de0] [c02117f4] dev_change_flags+0x1c/0x60 [ee275e00] [c02669b4] devinet_ioctl+0x2a4/0x700 [ee275e60] [c02677bc] inet_ioctl+0xc8/0xfc [ee275e70] [c01ffba4] sock_ioctl+0x260/0x2a0 [ee275e90] [c0090a80] vfs_ioctl+0x2c/0x58 [ee275ea0] [c00911ec] do_vfs_ioctl+0x610/0x698 [ee275f10] [c00912cc] sys_ioctl+0x58/0x88 [ee275f40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 ee2a97b8: redzone 1:0x300574f524b4752, redzone 2:0xd84156c5635688c0. slab error in verify_redzone_free(): cache `size-2048': memory outside object was overwritten Call Trace: [ee275c70] [c0007b80] show_stack+0x58/0x154 (unreliable) [ee275cb0] [c007bb0c] __slab_error+0x2c/0x3c [ee275cc0] [c007c0d0] cache_free_debugcheck+0x184/0x274 [ee275cf0] [c007c36c] kfree+0x90/0x10c [ee275d10] [c02079e4] skb_release_data+0xb4/0xc8 [ee275d20] [c02075dc] __kfree_skb+0x18/0xbc [ee275d30] [c0194d50] e1000_clean_rx_ring+0x10c/0x1a4 [ee275d60] [c0194e10] e1000_clean_all_rx_rings+0x28/0x54 [ee275d70] [c019835c] e1000_close+0x30/0xb4 [ee275d90] [c02118d8] __dev_close_many+0xa0/0xe0 [ee275da0] [c0213670] __dev_close+0x2c/0x4c [ee275dc0] [c020ff28] __dev_change_flags+0xb8/0x140 [ee275de0] [c02117f4] dev_change_flags+0x1c/0x60 [ee275e00] [c02669b4] devinet_ioctl+0x2a4/0x700 [ee275e60] [c02677bc] inet_ioctl+0xc8/0xfc [ee275e70] [c01ffba4] sock_ioctl+0x260/0x2a0 [ee275e90] [c0090a80] vfs_ioctl+0x2c/0x58 [ee275ea0] [c00911ec] do_vfs_ioctl+0x610/0x698 [ee275f10] [c00912cc] sys_ioctl+0x58/0x88 [ee275f40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 ee2a8fa0: redzone 1:0xd84156c5635688c0, redzone 2:0x534c4f545c42524f. Bringing eth2 up... e1000: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX ADDRCONF(NETDEV_UP): eth2: link is not ready ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready Tearing eth2 down... Unable to handle kernel paging request for data at address 0x Faulting instruction address: 0xc0061c64 Oops: Kernel access of bad area, sig: 11 [#1] SEL35xx Platform Modules linked in: NIP: c0061c64 LR: c02079a8 CTR: c000cbbc REGS: ee2c3c60 TRAP: 0300 Not tainted (3.0.80) MSR: 9032 EE,ME,IR,DR CR: 24008248 XER: DAR: , DSISR: 2000 TASK = ed56dba0[4730] 'ifconfig' THREAD: ee2c2000 GPR00: ee2c3d10 ed56dba0 2e6a2e2a 05f2 0002 GPR08: ef3d8da0 ee6a3428 0800 f04d 10087a48 bfd6bb1c 10064ae4 GPR16: 10064bc0 bfd6bb0c bfd6baf4 0228 8914 c0199aa4 GPR24: c0199fe8 ed70f4b0 0059 ed70f340 ef063120 0001 ee75e818 NIP [c0061c64] put_page+0x0/0x34 LR [c02079a8] skb_release_data+0x78/0xc8 Call Trace: [ee2c3d20] [c02075dc] __kfree_skb+0x18/0xbc [ee2c3d30] [c0194d50] e1000_clean_rx_ring+0x10c/0x1a4 [ee2c3d60] [c0194e10] e1000_clean_all_rx_rings+0x28/0x54 [ee2c3d70] [c019835c] e1000_close+0x30/0xb4 [ee2c3d90] [c02118d8] __dev_close_many+0xa0/0xe0 [ee2c3da0] [c0213670] __dev_close+0x2c/0x4c [ee2c3dc0] [c020ff28] __dev_change_flags+0xb8/0x140 [ee2c3de0] [c02117f4] dev_change_flags+0x1c/0x60 [ee2c3e00] [c02669b4] devinet_ioctl+0x2a4/0x700 [ee2c3e60] [c02677bc] inet_ioctl+0xc8/0xfc [ee2c3e70] [c01ffba4] sock_ioctl+0x260/0x2a0 [ee2c3e90] [c0090a80] vfs_ioctl+0x2c/0x58 [ee2c3ea0] [c00911ec] do_vfs_ioctl+0x610/0x698 [ee2c3f10] [c00912cc] sys_ioctl+0x58/0x88 [ee2c3f40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 Instruction dump:
Re: [E1000-devel] Memory Corruption with e1000
OK so a couple of thing kind of stand out. What interface is the e1000 on? eth0? That's not being called out or you filtered it out from the dmesg. Early on eth2 is the e1000 interface but later it's one of the Gianfar interfaces. Can you clear this up for us? Also, it looks like you have a bonding configuration. What interfaces are being bonded? You also have a Gianfar NIC with 2 interfaces. Is this still happening when no bonding is configured? Does the problem occur when the Gianfar interfaces are down/inactive? I'm just trying to narrow things down a bit. I'd like this to be tried with just the e1000 driver being active to see if it's happening then. Can you send the entire dmesg? Is it too big to email? Cheers, John -Original Message- From: pla...@gmail.com [mailto:pla...@gmail.com] On Behalf Of Peter LaDow Sent: Thursday, June 06, 2013 12:40 PM To: Ronciak, John Cc: Brandeburg, Jesse; Waskiewicz Jr, Peter P; e1000- de...@lists.sourceforge.net Subject: Re: [E1000-devel] Memory Corruption with e1000 On Thu, Jun 6, 2013 at 11:23 AM, Ronciak, John john.ronc...@intel.com wrote: I agree with Jesse but this driver has been in the field for a very long time with no reports like this coming to us. Can you send us the dmesg when this is happening? I want to see if there are messages from the driver like if the down is being delayed somehow. Or re-enabled. I stripped out the up/down messages. But yes, there are sometimes up messages. At the end is the complete dmesg output. I've tweaked the script to print whenever the interface is changed. It appears that the slab errors are when the interface comes down: Bringing eth2 up... e1000: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX ADDRCONF(NETDEV_UP): eth2: link is not ready ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready Tearing eth2 down... slab error in verify_redzone_free(): cache `size-2048': memory outside object was overwritten Call Trace: [ee275c70] [c0007b80] show_stack+0x58/0x154 (unreliable) [ee275cb0] [c007bb0c] __slab_error+0x2c/0x3c [ee275cc0] [c007c0d0] cache_free_debugcheck+0x184/0x274 [ee275cf0] [c007c36c] kfree+0x90/0x10c [ee275d10] [c02079e4] skb_release_data+0xb4/0xc8 [ee275d20] [c02075dc] __kfree_skb+0x18/0xbc [ee275d30] [c0194d50] e1000_clean_rx_ring+0x10c/0x1a4 [ee275d60] [c0194e10] e1000_clean_all_rx_rings+0x28/0x54 [ee275d70] [c019835c] e1000_close+0x30/0xb4 [ee275d90] [c02118d8] __dev_close_many+0xa0/0xe0 [ee275da0] [c0213670] __dev_close+0x2c/0x4c [ee275dc0] [c020ff28] __dev_change_flags+0xb8/0x140 [ee275de0] [c02117f4] dev_change_flags+0x1c/0x60 [ee275e00] [c02669b4] devinet_ioctl+0x2a4/0x700 [ee275e60] [c02677bc] inet_ioctl+0xc8/0xfc [ee275e70] [c01ffba4] sock_ioctl+0x260/0x2a0 [ee275e90] [c0090a80] vfs_ioctl+0x2c/0x58 [ee275ea0] [c00911ec] do_vfs_ioctl+0x610/0x698 [ee275f10] [c00912cc] sys_ioctl+0x58/0x88 [ee275f40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 ee2a97b8: redzone 1:0x300574f524b4752, redzone 2:0xd84156c5635688c0. slab error in verify_redzone_free(): cache `size-2048': memory outside object was overwritten Call Trace: [ee275c70] [c0007b80] show_stack+0x58/0x154 (unreliable) [ee275cb0] [c007bb0c] __slab_error+0x2c/0x3c [ee275cc0] [c007c0d0] cache_free_debugcheck+0x184/0x274 [ee275cf0] [c007c36c] kfree+0x90/0x10c [ee275d10] [c02079e4] skb_release_data+0xb4/0xc8 [ee275d20] [c02075dc] __kfree_skb+0x18/0xbc [ee275d30] [c0194d50] e1000_clean_rx_ring+0x10c/0x1a4 [ee275d60] [c0194e10] e1000_clean_all_rx_rings+0x28/0x54 [ee275d70] [c019835c] e1000_close+0x30/0xb4 [ee275d90] [c02118d8] __dev_close_many+0xa0/0xe0 [ee275da0] [c0213670] __dev_close+0x2c/0x4c [ee275dc0] [c020ff28] __dev_change_flags+0xb8/0x140 [ee275de0] [c02117f4] dev_change_flags+0x1c/0x60 [ee275e00] [c02669b4] devinet_ioctl+0x2a4/0x700 [ee275e60] [c02677bc] inet_ioctl+0xc8/0xfc [ee275e70] [c01ffba4] sock_ioctl+0x260/0x2a0 [ee275e90] [c0090a80] vfs_ioctl+0x2c/0x58 [ee275ea0] [c00911ec] do_vfs_ioctl+0x610/0x698 [ee275f10] [c00912cc] sys_ioctl+0x58/0x88 [ee275f40] [c000e674] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 ee2a8fa0: redzone 1:0xd84156c5635688c0, redzone 2:0x534c4f545c42524f. Bringing eth2 up... e1000: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX ADDRCONF(NETDEV_UP): eth2: link is not ready ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready Tearing eth2 down... Unable to handle kernel paging request for data at address 0x Faulting instruction address: 0xc0061c64 Oops: Kernel access of bad area, sig: 11 [#1] SEL35xx Platform Modules linked in: NIP: c0061c64 LR: c02079a8 CTR: c000cbbc REGS: ee2c3c60 TRAP: 0300 Not tainted (3.0.80) MSR: 9032 EE,ME,IR,DR CR: 24008248 XER: DAR: , DSISR: 2000 TASK = ed56dba0[4730] 'ifconfig' THREAD: ee2c2000 GPR00: ee2c3d10 ed56dba0 2e6a2e2a
Re: [E1000-devel] Memory Corruption with e1000
On Thu, Jun 6, 2013 at 1:10 PM, Ronciak, John john.ronc...@intel.com wrote: OK so a couple of thing kind of stand out. What interface is the e1000 on? eth0? That's not being called out or you filtered it out from the dmesg. Early on eth2 is the e1000 interface but later it's one of the Gianfar interfaces. Can you clear this up for us? The interfaces do get renamed early in the boot process. We use ifrename to force the e1000 interface to eth2. The gianfar are on eth0 and eth1. Also, it looks like you have a bonding configuration. What interfaces are being bonded? You also have a Gianfar NIC with 2 interfaces. Is this still happening when no bonding is configured? Does the problem occur when the Gianfar interfaces are down/inactive? I'm just trying to narrow things down a bit. I'd like this to be tried with just the e1000 driver being active to see if it's happening then. Currently, there is no bonding configured at all. While we do allow bonding, there is currently no bonded interfaces. I tried the up/down loop with the gianfar devices, and I do not get the failure. They are connected to the same network, and no problem. I shutdown the gianfar adapters (eth0 and eth1) and re-ran the up/down loop. Still get the same panic. Can you send the entire dmesg? Is it too big to email? That was the entire dmesg output. Thanks, Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] Memory Corruption with e1000
Hi Peter, We have some ideas and are working on a patch for you to try. Since we won't really be able to test it can you do that if we get it to you? Do you know how to patch a driver? Or should we send you the whole thing (a complete new driver like you would get off of our SF site)? Cheers, John -Original Message- From: pla...@gmail.com [mailto:pla...@gmail.com] On Behalf Of Peter LaDow Sent: Thursday, June 06, 2013 1:22 PM To: Ronciak, John Cc: Brandeburg, Jesse; Waskiewicz Jr, Peter P; e1000- de...@lists.sourceforge.net Subject: Re: [E1000-devel] Memory Corruption with e1000 On Thu, Jun 6, 2013 at 1:10 PM, Ronciak, John john.ronc...@intel.com wrote: OK so a couple of thing kind of stand out. What interface is the e1000 on? eth0? That's not being called out or you filtered it out from the dmesg. Early on eth2 is the e1000 interface but later it's one of the Gianfar interfaces. Can you clear this up for us? The interfaces do get renamed early in the boot process. We use ifrename to force the e1000 interface to eth2. The gianfar are on eth0 and eth1. Also, it looks like you have a bonding configuration. What interfaces are being bonded? You also have a Gianfar NIC with 2 interfaces. Is this still happening when no bonding is configured? Does the problem occur when the Gianfar interfaces are down/inactive? I'm just trying to narrow things down a bit. I'd like this to be tried with just the e1000 driver being active to see if it's happening then. Currently, there is no bonding configured at all. While we do allow bonding, there is currently no bonded interfaces. I tried the up/down loop with the gianfar devices, and I do not get the failure. They are connected to the same network, and no problem. I shutdown the gianfar adapters (eth0 and eth1) and re-ran the up/down loop. Still get the same panic. Can you send the entire dmesg? Is it too big to email? That was the entire dmesg output. Thanks, Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
[E1000-devel] Memory Corruption with e1000
We are running a PPC system with an 82540EP that is causing kernel panics when there is heavy traffic and the interface is brought up and/or down (we aren't sure which yet). We are running 3.0.57-rt82, but we can re-create this issue reliably with 3.0.80 and 3.0.80-rt109 with the base version included in the kernel (which is 7.3.21-k8-NAPI). However, I've also tried 8.0.35, and get the same failure. We've narrowed it down to this case and can reliably re-create the issue with a tight loop, such as: while : do ip link set eth2 up sleep 10 ip link set eth2 down sleep 10 done I'm not sure where to look and any help would be appreciated. In this loop we can reliably generate a kernel panic such as: Unable to handle kernel paging request for data at address 0x20454a46 Faulting instruction address: 0xc0069924 Oops: Kernel access of bad area, sig: 11 [#1] PREEMPT PPC Platform Modules linked in: NIP: c0069924 LR: c021cce0 CTR: c000cecc REGS: ed4f1c60 TRAP: 0300 Not tainted (3.0.80-rt108) MSR: 9032 EE,ME,IR,DR CR: 24008248 XER: DAR: 20454a46, DSISR: 2000 TASK = eda46780[3106] 'ifconfig' THREAD: ed4f GPR00: ed4f1d10 eda46780 20454a46 2d6fcc2a 05f2 0002 GPR08: eda46780 ed6fd228 ed4f1cd0 90b1 10084718 bfcceaec 10062044 GPR16: 10062120 bfcceadc bfcceac4 0228 8914 c01ac398 GPR24: c01ac8c8 ed066520 0061 ed0663a0 ef0448f0 0001 ed575580 NIP [c0069924] put_page+0x0/0x34 LR [c021cce0] skb_release_data+0x78/0xc8 Call Trace: [ed4f1d20] [c021c914] __kfree_skb+0x18/0xbc [ed4f1d30] [c01a7620] e1000_clean_rx_ring+0x10c/0x1a4 [ed4f1d60] [c01a76e0] e1000_clean_all_rx_rings+0x28/0x54 [ed4f1d70] [c01aac50] e1000_close+0x30/0xb4 [ed4f1d90] [c0226e2c] __dev_close_many+0xa0/0xe0 [ed4f1da0] [c0228c64] __dev_close+0x2c/0x4c [ed4f1dc0] [c0225224] __dev_change_flags+0xb8/0x140 [ed4f1de0] [c0226d48] dev_change_flags+0x1c/0x60 [ed4f1e00] [c027e7f8] devinet_ioctl+0x2a4/0x700 [ed4f1e60] [c027f450] inet_ioctl+0xc8/0xfc [ed4f1e70] [c02147b0] sock_ioctl+0x260/0x2a0 [ed4f1e90] [c009b468] vfs_ioctl+0x2c/0x58 [ed4f1ea0] [c009bc44] do_vfs_ioctl+0x64c/0x6d4 [ed4f1f10] [c009bd24] sys_ioctl+0x58/0x88 [ed4f1f40] [c000e954] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 Instruction dump: 7c0802a6 3c80c007 3884a500 90010024 38a10008 3800 90010008 4b0d 80010024 38210020 7c0803a6 4e800020 8003 7c691b78 700bc000 41a20008 Kernel panic - not syncing: Fatal exception Call Trace: [ed4f1b90] [c0007ccc] show_stack+0x58/0x154 (unreliable) [ed4f1bd0] [c001d744] panic+0xb0/0x1d8 [ed4f1c20] [c000b4b8] die+0x1ac/0x1d0 [ed4f1c40] [c0011e38] bad_page_fault+0xe8/0xfc [ed4f1c50] [c000edf4] handle_page_fault+0x7c/0x80 --- Exception: 300 at put_page+0x0/0x34 LR = skb_release_data+0x78/0xc8 [ed4f1d10] [] (null) (unreliable) [ed4f1d20] [c021c914] __kfree_skb+0x18/0xbc [ed4f1d30] [c01a7620] e1000_clean_rx_ring+0x10c/0x1a4 [ed4f1d60] [c01a76e0] e1000_clean_all_rx_rings+0x28/0x54 [ed4f1d70] [c01aac50] e1000_close+0x30/0xb4 [ed4f1d90] [c0226e2c] __dev_close_many+0xa0/0xe0 [ed4f1da0] [c0228c64] __dev_close+0x2c/0x4c [ed4f1dc0] [c0225224] __dev_change_flags+0xb8/0x140 [ed4f1de0] [c0226d48] dev_change_flags+0x1c/0x60 [ed4f1e00] [c027e7f8] devinet_ioctl+0x2a4/0x700 [ed4f1e60] [c027f450] inet_ioctl+0xc8/0xfc [ed4f1e70] [c02147b0] sock_ioctl+0x260/0x2a0 [ed4f1e90] [c009b468] vfs_ioctl+0x2c/0x58 [ed4f1ea0] [c009bc44] do_vfs_ioctl+0x64c/0x6d4 [ed4f1f10] [c009bd24] sys_ioctl+0x58/0x88 [ed4f1f40] [c000e954] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff35a3c LR = 0xff359a0 When turning on SLAB checks, I see: Slab corruption: size-16384 start=ed4ec000, len=16384 690: 6b 6b ff ff ff ff ff ff b8 ac 6f 99 bf 8b 08 00 6a0: 45 00 00 24 3f 34 00 00 80 11 ca cf 0a ca 0d 33 6b0: 0a ca 0d ff 06 cc 06 cf 00 10 bc 1d c5 0b 40 01 6c0: 00 10 00 33 00 00 00 00 00 00 00 00 00 00 3f dd 6d0: ed f8 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ea0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ff ff ff ff ff ff Slab corruption: size-2048 start=ed4e6570, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [ (null)](0x0) 0c0: 6b 6b ff ff ff ff ff ff 5c 26 0a 41 81 27 08 00 0d0: 45 00 00 4e 7d 44 00 00 80 11 8c 79 0a ca 0d 4f 0e0: 0a ca 0d ff 00 89 00 89 00 3a b5 a7 be 71 01 10 0f0: 00 01 00 00 00 00 00 00 20 45 4c 45 43 45 50 46 100: 49 43 41 43 41 43 41 43 41 43 41 43 41 43 41 43 110: 41 43 41 43 41 43 41 41 41 00 00 20 00 01 02 5a Next obj: start=ed4e6d88, len=2048 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [c021e294](__netdev_alloc_skb+0x28/0x60) 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a Slab corruption: size-2048 start=ed54eb48, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [c021cd1c](skb_release_data+0xb4/0xc8) 020: 6b 6b ff ff ff ff ff ff 18 03 73 e4 64 18 08 00 030: 45 00 00 44 61 b8 00 00 80 11 a7 c9 0a ca 0d 95 040: 0a ca
Re: [E1000-devel] Memory Corruption with e1000
On Wed, Jun 5, 2013 at 3:01 PM, Peter LaDow pet...@gocougs.wsu.edu wrote: After some more digging, I'm wondering if this is indeed a timing issue. Is there a problem with bringing up an interface too soon after taking it down? If I change my loop to use a 30 second delay between interface bringup/teardown, I don't get the panic. Scratch that. A 30 second delay didn't eliminate the problem. It only delayed it. I finally got a similar failure. I further increased the time and got another failure, slightly different: [ cut here ] WARNING: at include/linux/skbuff.h:1468 Modules linked in: NIP: c0219bf8 LR: c01abaec CTR: c01aba74 REGS: ed6dbcf0 TRAP: 0700 Not tainted (3.0.57-rt82) MSR: 00029032 EE,ME,CE,IR,DR CR: 42048044 XER: TASK = ed7afb60[3120] 'irq/20-eth2' THREAD: ed6da000 GPR00: 0001 ed6dbda0 ed7afb60 ed6c7800 0001 ed05e000 3b9ac9ff GPR08: ed7afb60 c035 ed6dbce0 c03d2748 42048044 1001aa90 ed6dbe78 c0352c54 GPR16: c03f ed05e520 ed05e000 05f4 ef047000 ef047060 05f2 GPR24: ef078320 ef078320 00ba 00bc f3241740 ed05e3a0 ed6c7800 0001 NIP [c0219bf8] skb_trim+0x18/0x34 LR [c01abaec] e1000_alloc_rx_buffers+0x78/0x374 Call Trace: [ed6dbda0] [ef078320] 0xef078320 (unreliable) [ed6dbdf0] [c01ab714] e1000_clean_rx_irq+0x35c/0x3ac [ed6dbe60] [c01ac2cc] e1000_clean+0x340/0x4ec [ed6dbec0] [c022799c] net_rx_action+0xc4/0x208 [ed6dbef0] [c0023410] __do_softirq_common+0xa4/0x13c [ed6dbf30] [c0023adc] local_bh_enable+0x88/0xe8 [ed6dbf50] [c0059acc] irq_forced_thread_fn+0x5c/0x74 [ed6dbf70] [c005a954] irq_thread+0xe4/0x1ec [ed6dbfa0] [c0038ce4] kthread+0x78/0x7c [ed6dbff0] [c000d608] kernel_thread+0x4c/0x68 Instruction dump: 7c1d492e 80010024 bba10014 38210020 7c0803a6 4e800020 8003004c 7f802040 4c9d0020 80030050 2f80 41be000c 0fe0 4e800020 800300a4 9083004c ---[ end trace 0002 ]--- Followed again by a bad paging request. I'm still at a loss to discover who is doing this corruption. Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] Memory Corruption with e1000
On 6/5/13, Ronciak, John john.ronc...@intel.com wrote: So I have a couple of questions. Does this happen with a non-preemptive kernel? I understand that you probably need to use a preemptive kernel but for testing purposes it would be good to know. We don't always test with preemptive kernels. Hmmm... If you mean no RT patches, then yes. On a vanilla 3.0.80 kernel. When doing the up/down transitions is there system under test? I mean sending and receiving packets? If it is what is the load like? Does changing the load make a difference? Does stopping the network traffic first make a difference in the outcome? Yes, the load makes a difference. On a silent network (or no link at all) this does not occur. Our network is quite busy. It isn't sending much (perhaps DHCP discovers and some IPv6 stuff). Thanks, Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] Memory Corruption with e1000
Quick followup. What I meant by not sending much is the adapter, not the network. The network is very busy. However, there is hardly any outgoing traffic from the box. On 6/5/13, Peter LaDow pet...@gocougs.wsu.edu wrote: On 6/5/13, Ronciak, John john.ronc...@intel.com wrote: So I have a couple of questions. Does this happen with a non-preemptive kernel? I understand that you probably need to use a preemptive kernel but for testing purposes it would be good to know. We don't always test with preemptive kernels. Hmmm... If you mean no RT patches, then yes. On a vanilla 3.0.80 kernel. When doing the up/down transitions is there system under test? I mean sending and receiving packets? If it is what is the load like? Does changing the load make a difference? Does stopping the network traffic first make a difference in the outcome? Yes, the load makes a difference. On a silent network (or no link at all) this does not occur. Our network is quite busy. It isn't sending much (perhaps DHCP discovers and some IPv6 stuff). Thanks, Pete -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired