Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
reassign 671895 src:linux thanks On Tue, May 22, 2012 at 07:26:22PM -0300, gustavo panizzo gfa wrote: On Fri, May 11, 2012 at 11:04:22PM +0100, Jurij Smakov wrote: [snip] Only two non-trivial things here: execution of ethtool_lite(if_name) and invocation of arping. I would put my money on the former (defined in ethtool_lite.c), because it uses low-level ioctls to query the interface state. You can test whether running it would trigger a failure on your machine by downloading ethtool_lite.c and building it as a standalone binary, the following commands appear to do the trick: $ sudo apt-get build-dep netcfg [...] $ gcc -o ethtool-lite -DTEST ethtool-lite.c -ldebconfclient -ldebian-installer $ sudo ./ethtool-lite eth0 ethtool-lite: eth0 is connected. $ If that triggers a null pointer exception on your machine (try it both with and without network brought up and check dmesg afterwards), we will be in a very good position to report it upstream for fixing. i cannot repeat the issue using ethtool-lite (or arping) while booting from disk, i can repeat the issue booting from network (22/05/2012 image) running netcfg or udhcp also i can repeat the issue running ~ # ip link set dev eth0 up while the cable is plugged in, or running the command and plugging the cable later if i (after getting the netimage) remove the link on eth0 and plug eth1, installer works fine Does this still occur with current kernels? Cheers, Moritz -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
On Fri, May 11, 2012 at 11:04:22PM +0100, Jurij Smakov wrote: [snip] Only two non-trivial things here: execution of ethtool_lite(if_name) and invocation of arping. I would put my money on the former (defined in ethtool_lite.c), because it uses low-level ioctls to query the interface state. You can test whether running it would trigger a failure on your machine by downloading ethtool_lite.c and building it as a standalone binary, the following commands appear to do the trick: $ sudo apt-get build-dep netcfg [...] $ gcc -o ethtool-lite -DTEST ethtool-lite.c -ldebconfclient -ldebian-installer $ sudo ./ethtool-lite eth0 ethtool-lite: eth0 is connected. $ If that triggers a null pointer exception on your machine (try it both with and without network brought up and check dmesg afterwards), we will be in a very good position to report it upstream for fixing. i cannot repeat the issue using ethtool-lite (or arping) while booting from disk, i can repeat the issue booting from network (22/05/2012 image) running netcfg or udhcp also i can repeat the issue running ~ # ip link set dev eth0 up while the cable is plugged in, or running the command and plugging the cable later if i (after getting the netimage) remove the link on eth0 and plug eth1, installer works fine -- 1AE0 322E B8F7 4717 BDEA BF1D 44BB 1BA7 9F6C 6333 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
On Fri, 2012-05-11 at 12:25 -0300, gustavo panizzo wrote: adding debian-boot i've installed unstable on the box (using debootstrap) and it boots 3.2.0-2-sparc64 sucessfully, networking works obp diags shows no errors but when i boot from network using http://d-i.debian.org/daily-images/sparc/daily/netboot/boot.img 11-05-2012 i get the following error ┌───┤ Detecting link on eth0; please wait... ├┐ │ │ │ 100% [ 246.994391] Unable to handle kernel NULL pointer dereference 247.074490] tsk-{mm,active_mm}-context = 019f │ 14;10H[ 247.164534] tsk-{mm,active_mm}-pgd = f8001d48c000│ [ 247.240508] Kernel panic - not syncing: Aiee, killing interrupt handler! │ [ 247.328648] Call Trace: │ [ 247.360793] [0045dcd4] do_exit+0x94/0x708 │ [ 247.423821] [00427550] die_if_kernel+0x2a0/0x2c8┘ [ 247.494864] [00768c84] unhandled_fault+0x8c/0x98 [ 247.565915] [0076936c] do_sparc64_fault+0x6dc/0x780 [ 247.640377] [00407880] sparc64_realfault_common+0x10/0x20 [ 247.721722] [10015680] gem_poll+0x9fc/0x1328 [sungem] [...] This means we crashed: static __inline__ void gem_tx(struct net_device *dev, struct gem *gp, u32 gem_status) { int entry, limit; entry = gp-tx_old; limit = ((gem_status GREG_STAT_TXNR) GREG_STAT_TXNR_SHIFT); while (entry != limit) { struct sk_buff *skb; struct gem_txd *txd; dma_addr_t dma_addr; u32 dma_len; int frag; if (netif_msg_tx_done(gp)) printk(KERN_DEBUG %s: tx done, slot %d\n, gp-dev-name, entry); skb = gp-tx_skbs[entry]; if (skb_shinfo(skb)-nr_frags) { right here, while evaluating skb_shinfo(skb). Which probably means skb was null. This *could* be due to broken hardware telling us that more packets were sent then we actually queued, but probably not since 'networking works' when not using netboot. Is the driver successfully resetting the network controller while net-booting? It can time-out and will then log SW reset is ghetto but will *not* abort initialisation. Ben. -- Ben Hutchings Experience is directly proportional to the value of equipment destroyed. - Carolyn Scheppner signature.asc Description: This is a digitally signed message part
Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
adding debian-boot i've installed unstable on the box (using debootstrap) and it boots 3.2.0-2-sparc64 sucessfully, networking works obp diags shows no errors but when i boot from network using http://d-i.debian.org/daily-images/sparc/daily/netboot/boot.img 11-05-2012 i get the following error ┌───┤ Detecting link on eth0; please wait... ├┐ │ │ │ 100% [ 246.994391] Unable to handle kernel NULL pointer dereference 247.074490] tsk-{mm,active_mm}-context = 019f │ 14;10H[ 247.164534] tsk-{mm,active_mm}-pgd = f8001d48c000│ [ 247.240508] Kernel panic - not syncing: Aiee, killing interrupt handler! │ [ 247.328648] Call Trace: │ [ 247.360793] [0045dcd4] do_exit+0x94/0x708 │ [ 247.423821] [00427550] die_if_kernel+0x2a0/0x2c8┘ [ 247.494864] [00768c84] unhandled_fault+0x8c/0x98 [ 247.565915] [0076936c] do_sparc64_fault+0x6dc/0x780 [ 247.640377] [00407880] sparc64_realfault_common+0x10/0x20 [ 247.721722] [10015680] gem_poll+0x9fc/0x1328 [sungem] [ 247.798478] [00697110] net_rx_action+0x9c/0x234 [ 247.868369] [004607f0] __do_softirq+0xdc/0x1c4 [ 247.937125] [0042a76c] do_softirq+0x54/0x80 [ 248.002442] [00460a6c] irq_exit+0x38/0x94 [ 248.065474] [0042df38] timer_interrupt+0x90/0xa8 [ 248.136516] [004209d4] tl0_irq14+0x14/0x20 [ 248.200692] [0049e764] touch_softlockup_watchdog+0x4/0xc [ 248.280888] [008f07e4] start_kernel+0x390/0x3a0 [ 248.350783] [00750b88] tlb_fixup_done+0x80/0x88 [ 248.420672] [] (null) [ 248.481416] Press Stop-A (L1-A) to return to the boot prom -- 1AE0 322E B8F7 4717 BDEA BF1D 44BB 1BA7 9F6C 6333 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
On Fri, May 11, 2012 at 12:25:01PM -0300, gustavo panizzo gfa wrote: adding debian-boot i've installed unstable on the box (using debootstrap) and it boots 3.2.0-2-sparc64 sucessfully, networking works obp diags shows no errors but when i boot from network using http://d-i.debian.org/daily-images/sparc/daily/netboot/boot.img 11-05-2012 i get the following error ┌───┤ Detecting link on eth0; please wait... ├┐ │ │ │ 100% [ 246.994391] Unable to handle kernel NULL pointer dereference 247.074490] tsk-{mm,active_mm}-context = 019f │ 14;10H[ 247.164534] tsk-{mm,active_mm}-pgd = f8001d48c000│ [ 247.240508] Kernel panic - not syncing: Aiee, killing interrupt handler! │ [ 247.328648] Call Trace: │ [ 247.360793] [0045dcd4] do_exit+0x94/0x708 │ [ 247.423821] [00427550] die_if_kernel+0x2a0/0x2c8┘ [ 247.494864] [00768c84] unhandled_fault+0x8c/0x98 [ 247.565915] [0076936c] do_sparc64_fault+0x6dc/0x780 [ 247.640377] [00407880] sparc64_realfault_common+0x10/0x20 [ 247.721722] [10015680] gem_poll+0x9fc/0x1328 [sungem] [ 247.798478] [00697110] net_rx_action+0x9c/0x234 [ 247.868369] [004607f0] __do_softirq+0xdc/0x1c4 [ 247.937125] [0042a76c] do_softirq+0x54/0x80 [ 248.002442] [00460a6c] irq_exit+0x38/0x94 [ 248.065474] [0042df38] timer_interrupt+0x90/0xa8 [ 248.136516] [004209d4] tl0_irq14+0x14/0x20 [ 248.200692] [0049e764] touch_softlockup_watchdog+0x4/0xc [ 248.280888] [008f07e4] start_kernel+0x390/0x3a0 [ 248.350783] [00750b88] tlb_fixup_done+0x80/0x88 [ 248.420672] [] (null) [ 248.481416] Press Stop-A (L1-A) to return to the boot prom Interesting, so we are doing something funky during link detection to trip this bug. The code which does it is in netcfg: http://anonscm.debian.org/gitweb/?p=d-i/netcfg.git;a=tree Here's the relevant code from netcfg-common.c: 1277 debconf_capb(client, progresscancel); 1278 debconf_subst(client, netcfg/link_detect_progress, interface, if_name); 1279 debconf_progress_start(client, 0, 100, netcfg/link_detect_progress); 1280 for (count = 0; count link_waits; count++) { 1281 usleep(25); 1282 if (debconf_progress_set(client, 50 * count / link_waits) == 30) { 1283 /* User cancelled on us... bugger */ 1284 rv = 0; 1285 break; 1286 } 1287 if (ethtool_lite(if_name) == 1) /* ethtool-lite's CONNECTED */ { 1288 if (gateway.s_addr !is_wireless_iface(if_name)) { 1289 for (count = 0; count gw_tries; count++) { 1290 if (di_exec_shell_log(arping) == 0) 1291 break; 1292 if (debconf_progress_set(client, 50 + 50 * count / gw_tries) == 30) 1293 break; 1294 } 1295 } 1296 rv = 1; 1297 break; 1298 } 1299 debconf_progress_set(client, 100); 1300 } Only two non-trivial things here: execution of ethtool_lite(if_name) and invocation of arping. I would put my money on the former (defined in ethtool_lite.c), because it uses low-level ioctls to query the interface state. You can test whether running it would trigger a failure on your machine by downloading ethtool_lite.c and building it as a standalone binary, the following commands appear to do the trick: $ sudo apt-get build-dep netcfg [...] $ gcc -o ethtool-lite -DTEST ethtool-lite.c -ldebconfclient -ldebian-installer $ sudo ./ethtool-lite eth0 ethtool-lite: eth0 is connected. $ If that triggers a null pointer exception on your machine (try it both with and without network brought up and check dmesg afterwards), we will be in a very good position to report it upstream for fixing. Best regards, -- Jurij Smakov ju...@wooyd.org Key: http://www.wooyd.org/pgpkey/ KeyID: C99E03CC -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
Jurij Smakov ju...@wooyd.org wrote: If that triggers a null pointer exception on your machine (try it both with and without network brought up and check dmesg afterwards), we will be in a very good position to report it upstream for fixing. i will be checking it next week -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
Interesting. How does a 3.2.y kernel behave with the ancient gentoo userland? (Perhaps this is what you are planning to try later.) Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: d Wake-on: d Current message level: 0x0007 (7) Link detected: yes kernel is 3.2.15 taken out from apt-get linux-source-3.2 config is the same gentoo config i cannot get to boot linux-image-3.2.0-2-sparc64_3.2.16-1_sparc due to not being able to mount root fs i see this errors on kernel log [ 52.363317] sun_esp: Unknown symbol scsi_esp_register (err 0) [ 52.439003] sun_esp: Unknown symbol scsi_esp_intr (err 0) [ 52.509998] sun_esp: Unknown symbol scsi_host_put (err 0) [ 52.581304] sun_esp: Unknown symbol scsi_esp_template (err 0) [ 52.656890] sun_esp: Unknown symbol scsi_esp_unregister (err 0) [ 52.734804] sun_esp: Unknown symbol scsi_esp_cmd (err 0) [ 52.804672] sun_esp: Unknown symbol scsi_host_alloc (err 0) [ 53.004224] SCSI subsystem initialized i will continue to experiment with this kernel (hopefully debootstrap will finish soon) -- 1AE0 322E B8F7 4717 BDEA BF1D 44BB 1BA7 9F6C 6333 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
Hi Gustavo, gustavo panizzo wrote: i can get the nic to work using latest linus tree + ancient gentoo userland (udev 124), but is running at 10Mb/s half duplex 3.4.0-rc6+ Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full [...] Advertised auto-negotiation: Yes Speed: 10Mb/s Duplex: Half [...] Auto-negotiation: off [...] while 2.6.28 runs at 100Mb/s full duplex Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full [...] Advertised auto-negotiation: No Speed: 100Mb/s Duplex: Full [...] Auto-negotiation: on [...] i will try latter with kernel from d-i or testing, but i think this sould go upstream Interesting. How does a 3.2.y kernel behave with the ancient gentoo userland? (Perhaps this is what you are planning to try later.) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org