Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Thu, Jan 21, 2016 at 12:30:48PM +, Joao Martins wrote: > > > On 01/20/2016 09:59 PM, Konrad Rzeszutek Wilk wrote: > > On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote: > >> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote: > >>> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki > >>> wrote: > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki > wrote: > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki > > wrote: > >> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: > >>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > Hi all, > > I'm experiencing xen-netfront crash when doing xl network-detach > while > some network activity is going on at the same time. It happens only > when > domU has more than one vcpu. Not sure if this matters, but the > backend > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on > kernel > 3.9.4 and 4.1-rc1 as well. > > Steps to reproduce: > 1. Start the domU with some network interface > 2. Call there 'ping -f some-IP' > 3. Call 'xl network-detach NAME 0' > >>> > >>> Do you see this all the time or just on occassions? > >> > >> Using above procedure - all the time. > >> > >>> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM? > >> > >> PV, started by libvirt. This may have something to do, the problem didn't > >> existed on older Xen (4.1) and started by xl. I'm not sure about kernel > >> version there, but I think I've tried there 3.18 too, which has this > >> problem. > >> > >> But I don't see anything special in domU config file (neither backend > >> nor frontend) - it may be some libvirt default. If that's really the > >> cause. Can I (and how) get any useful information about that? > > > > libvirt naturally does some libxl calls, and they may be different. > > > > Any chance you could give me an idea of: > > - What commands you use in libvirt? > > - Do you use a bond or bridge? > > - What version of libvirt you are using? > > > > Thanks! > > CC-ing Joao just in case he has seen this. > >> > Hm, So far I couldn't reproduce the issue with upstream Xen/linux/libvirt, > using > both libvirt or plain xl (both on a bridge setup) and also irrespective of the > both load and direction of traffic (be it a ping flood, pktgen with min. > sized packets or iperf). I've ran the test again, on vanilla 4.4 and collected some info: - xenstore dump of frontend (xs-frontend-before.txt) - xenstore dump of backend (xs-backend-before.txt) - kernel messages (console output) (console.log) - kernel config (config-4.4) - libvirt config of that domain (netdebug.conf) Versions: - kernel 4.4 (frontend), 4.2.8 (backend) - libvirt 1.2.20 - xen 4.6.0 In backend domain there is no bridge or anything like that - only routing. The same in frontend - nothing fancy - just IP set on eth0 there. Steps to reproduce were the same: - start frontend domain (virsh create ...) - call ping -f - xl network-detach NAME 0 Note that the crash doesn't happen with attached patch applied (as noted in mail on Oct 21), but I have no idea whether is it a proper fix, or just prevents the crash by a coincidence. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? [0.00] x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Initializing cgroup subsys cpuacct [0.00] Linux version 4.4.0-1.pvops.qubes.x86_64 (user@devel-3rdparty) (gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #20 SMP Fri Jan 22 00:39:29 CET 2016 [0.00] Command line: root=/dev/mapper/dmroot ro nomodeset console=hvc0 rd_NO_PLYMOUTH 3 rd.break [0.00] x86/fpu: Legacy x87 FPU detected. [0.00] x86/fpu: Using 'lazy' FPU context switches. [0.00] ACPI in unprivileged domain disabled [0.00] Released 0 page(s) [0.00] e820: BIOS-provided physical RAM map: [0.00] Xen: [mem 0x-0x0009] usable [0.00] Xen: [mem 0x000a-0x000f] reserved [0.00] Xen: [mem 0x0010-0xf9ff] usable [0.00] NX (Execute Disable) protection: active [0.00] DMI not present or invalid. [0.00] Hypervisor detected: Xen [0.00] e820: last_pfn = 0xfa000 max_arch_pfn = 0x4 [0.00] MTRR: Disabled [0.00] RAMDISK: [mem 0x0203-0x027c6fff] [0.00] NUMA turned off [0.00] Faking a node at [mem 0x-0xf9ff] [0.00] NODE_DATA(0) allocated [mem 0x18837000-0x1
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On 01/20/2016 09:59 PM, Konrad Rzeszutek Wilk wrote: > On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote: >> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote: >>> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote: On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote: > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki > wrote: >> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: >>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: Hi all, I'm experiencing xen-netfront crash when doing xl network-detach while some network activity is going on at the same time. It happens only when domU has more than one vcpu. Not sure if this matters, but the backend is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel 3.9.4 and 4.1-rc1 as well. Steps to reproduce: 1. Start the domU with some network interface 2. Call there 'ping -f some-IP' 3. Call 'xl network-detach NAME 0' >>> >>> Do you see this all the time or just on occassions? >> >> Using above procedure - all the time. >> >>> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM? >> >> PV, started by libvirt. This may have something to do, the problem didn't >> existed on older Xen (4.1) and started by xl. I'm not sure about kernel >> version there, but I think I've tried there 3.18 too, which has this >> problem. >> >> But I don't see anything special in domU config file (neither backend >> nor frontend) - it may be some libvirt default. If that's really the >> cause. Can I (and how) get any useful information about that? > > libvirt naturally does some libxl calls, and they may be different. > > Any chance you could give me an idea of: > - What commands you use in libvirt? > - Do you use a bond or bridge? > - What version of libvirt you are using? > > Thanks! > CC-ing Joao just in case he has seen this. >> Hm, So far I couldn't reproduce the issue with upstream Xen/linux/libvirt, using both libvirt or plain xl (both on a bridge setup) and also irrespective of the both load and direction of traffic (be it a ping flood, pktgen with min. sized packets or iperf). >> >> -- >> Best Regards, >> Marek Marczykowski-Górecki >> Invisible Things Lab >> A: Because it messes up the order in which people normally read text. >> Q: Why is top-posting such a bad thing? > > ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote: > On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote: > > On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote: > > > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki > > > wrote: > > > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki > > > > wrote: > > > > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: > > > > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > I'm experiencing xen-netfront crash when doing xl network-detach > > > > > > > while > > > > > > > some network activity is going on at the same time. It happens > > > > > > > only when > > > > > > > domU has more than one vcpu. Not sure if this matters, but the > > > > > > > backend > > > > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on > > > > > > > kernel > > > > > > > 3.9.4 and 4.1-rc1 as well. > > > > > > > > > > > > > > Steps to reproduce: > > > > > > > 1. Start the domU with some network interface > > > > > > > 2. Call there 'ping -f some-IP' > > > > > > > 3. Call 'xl network-detach NAME 0' > > > > Do you see this all the time or just on occassions? > > Using above procedure - all the time. > > > I tried to reproduce it and couldn't see it. Is your VM an PV or HVM? > > PV, started by libvirt. This may have something to do, the problem didn't > existed on older Xen (4.1) and started by xl. I'm not sure about kernel > version there, but I think I've tried there 3.18 too, which has this > problem. > > But I don't see anything special in domU config file (neither backend > nor frontend) - it may be some libvirt default. If that's really the > cause. Can I (and how) get any useful information about that? libvirt naturally does some libxl calls, and they may be different. Any chance you could give me an idea of: - What commands you use in libvirt? - Do you use a bond or bridge? - What version of libvirt you are using? Thanks! CC-ing Joao just in case he has seen this. > > > -- > Best Regards, > Marek Marczykowski-Górecki > Invisible Things Lab > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote: > On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote: > > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote: > > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki > > > wrote: > > > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: > > > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > > > > > > Hi all, > > > > > > > > > > > > I'm experiencing xen-netfront crash when doing xl network-detach > > > > > > while > > > > > > some network activity is going on at the same time. It happens only > > > > > > when > > > > > > domU has more than one vcpu. Not sure if this matters, but the > > > > > > backend > > > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on > > > > > > kernel > > > > > > 3.9.4 and 4.1-rc1 as well. > > > > > > > > > > > > Steps to reproduce: > > > > > > 1. Start the domU with some network interface > > > > > > 2. Call there 'ping -f some-IP' > > > > > > 3. Call 'xl network-detach NAME 0' > > Do you see this all the time or just on occassions? Using above procedure - all the time. > I tried to reproduce it and couldn't see it. Is your VM an PV or HVM? PV, started by libvirt. This may have something to do, the problem didn't existed on older Xen (4.1) and started by xl. I'm not sure about kernel version there, but I think I've tried there 3.18 too, which has this problem. But I don't see anything special in domU config file (neither backend nor frontend) - it may be some libvirt default. If that's really the cause. Can I (and how) get any useful information about that? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? pgpuH5w8RTS3O.pgp Description: PGP signature ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote: > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote: > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote: > > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: > > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > > > > > Hi all, > > > > > > > > > > I'm experiencing xen-netfront crash when doing xl network-detach while > > > > > some network activity is going on at the same time. It happens only > > > > > when > > > > > domU has more than one vcpu. Not sure if this matters, but the backend > > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on > > > > > kernel > > > > > 3.9.4 and 4.1-rc1 as well. > > > > > > > > > > Steps to reproduce: > > > > > 1. Start the domU with some network interface > > > > > 2. Call there 'ping -f some-IP' > > > > > 3. Call 'xl network-detach NAME 0' Do you see this all the time or just on occassions? I tried to reproduce it and couldn't see it. Is your VM an PV or HVM? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On 21/10/15 19:57, Marek Marczykowski-Górecki wrote: > > Any ideas? No, sorry. Netfront looks correct to me. We take an additional ref for the ref released by gnttab_release_grant_reference(). The get_page() here is safe since we haven't freed the page yet (this is done in the subsequent call to skb_kfree_irq()). get_page()/put_page() also look fine when used with tail pages. David ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote: > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote: > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > > > > Hi all, > > > > > > > > I'm experiencing xen-netfront crash when doing xl network-detach while > > > > some network activity is going on at the same time. It happens only when > > > > domU has more than one vcpu. Not sure if this matters, but the backend > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel > > > > 3.9.4 and 4.1-rc1 as well. > > > > > > > > Steps to reproduce: > > > > 1. Start the domU with some network interface > > > > 2. Call there 'ping -f some-IP' > > > > 3. Call 'xl network-detach NAME 0' > > > > > > There's a use-after-free in xennet_remove(). Does this patch fix it? > > > > Unfortunately not. Note that the crash is in xennet_disconnect_backend, > > which is called before xennet_destroy_queues in xennet_remove. > > I've tried to add napi_disable and even netif_napi_del just after > > napi_synchronize in xennet_disconnect_backend (which would probably > > cause crash when trying to cleanup the same later again), but it doesn't > > help - the crash is the same (still in gnttab_end_foreign_access called > > from xennet_disconnect_backend). > > Finally I've found some more time to debug this... All tests redone on > v4.3-rc6 frontend and 3.18.17 backend. > > Looking at xennet_tx_buf_gc(), I have an impression that shared page > (queue->grant_tx_page[id]) is/should be freed in some other means than > (indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug > is that the page _is_ actually freed somewhere else already? At least changing > gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash > gone. > > Relevant xennet_tx_buf_gc fragment: > gnttab_end_foreign_access_ref( > queue->grant_tx_ref[id], GNTMAP_readonly); > gnttab_release_grant_reference( > &queue->gref_tx_head, queue->grant_tx_ref[id]); > queue->grant_tx_ref[id] = GRANT_INVALID_REF; > queue->grant_tx_page[id] = NULL; > add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, id); > dev_kfree_skb_irq(skb); > > And similar fragment from xennet_release_tx_bufs: > get_page(queue->grant_tx_page[i]); > gnttab_end_foreign_access(queue->grant_tx_ref[i], > GNTMAP_readonly, > (unsigned long)page_address(queue->grant_tx_page[i])); > queue->grant_tx_page[i] = NULL; > queue->grant_tx_ref[i] = GRANT_INVALID_REF; > add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, i); > dev_kfree_skb_irq(skb); > > Note that both have dev_kfree_skb_irq, but the former use > gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access. > Also note that the crash is in gnttab_end_foreign_access, so before > dev_kfree_skb_irq. If that would be double free, I'd expect crash in the > later. > > This change was introduced by cefe007 "xen-netfront: fix resource leak in > netfront". I'm not sure if changing gnttab_end_foreign_access back to > gnttab_end_foreign_access_ref would not (re)introduce some memory leak. > > Let me paste again the error message: > [ 73.718636] page:ea43b1c0 count:0 mapcount:0 mapping: > (null) index:0x0 > [ 73.718661] flags: 0x3ffc008000(tail) > [ 73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) > == 0) > [ 73.718725] [ cut here ] > [ 73.718743] kernel BUG at include/linux/mm.h:338! > > Also it all look quite strange - there is get_page() call just before > gnttab_end_foreign_access, but page->_count is still 0. Maybe it have > something > to do how get_page() works on "tail" pages (whatever it means)? > > static inline void get_page(struct page *page) > { > if (unlikely(PageTail(page))) > if (likely(__get_page_tail(page))) > return; > /* > * Getting a normal page or the head of a compound page > * requires to already have an elevated page->_count. > */ > VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); > atomic_inc(&page->_count); > } > > which (I think) ends up in: > > static inline void __get_page_tail_foll(struct page *page, > bool get_page_head) > { > /* > * If we're getting a tail page, the elevated page->_count is > * required only in the head page and we will elevate the head > * page->_count and tail page->_mapcount. > * > * We elevate page_tail->_mapcount for tail pages to force > * page_tail->_count to be zero at all times to avoid getting > * false positives f
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote: > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > > > Hi all, > > > > > > I'm experiencing xen-netfront crash when doing xl network-detach while > > > some network activity is going on at the same time. It happens only when > > > domU has more than one vcpu. Not sure if this matters, but the backend > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel > > > 3.9.4 and 4.1-rc1 as well. > > > > > > Steps to reproduce: > > > 1. Start the domU with some network interface > > > 2. Call there 'ping -f some-IP' > > > 3. Call 'xl network-detach NAME 0' > > > > There's a use-after-free in xennet_remove(). Does this patch fix it? > > Unfortunately not. Note that the crash is in xennet_disconnect_backend, > which is called before xennet_destroy_queues in xennet_remove. > I've tried to add napi_disable and even netif_napi_del just after > napi_synchronize in xennet_disconnect_backend (which would probably > cause crash when trying to cleanup the same later again), but it doesn't > help - the crash is the same (still in gnttab_end_foreign_access called > from xennet_disconnect_backend). Finally I've found some more time to debug this... All tests redone on v4.3-rc6 frontend and 3.18.17 backend. Looking at xennet_tx_buf_gc(), I have an impression that shared page (queue->grant_tx_page[id]) is/should be freed in some other means than (indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug is that the page _is_ actually freed somewhere else already? At least changing gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash gone. Relevant xennet_tx_buf_gc fragment: gnttab_end_foreign_access_ref( queue->grant_tx_ref[id], GNTMAP_readonly); gnttab_release_grant_reference( &queue->gref_tx_head, queue->grant_tx_ref[id]); queue->grant_tx_ref[id] = GRANT_INVALID_REF; queue->grant_tx_page[id] = NULL; add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, id); dev_kfree_skb_irq(skb); And similar fragment from xennet_release_tx_bufs: get_page(queue->grant_tx_page[i]); gnttab_end_foreign_access(queue->grant_tx_ref[i], GNTMAP_readonly, (unsigned long)page_address(queue->grant_tx_page[i])); queue->grant_tx_page[i] = NULL; queue->grant_tx_ref[i] = GRANT_INVALID_REF; add_id_to_freelist(&queue->tx_skb_freelist, queue->tx_skbs, i); dev_kfree_skb_irq(skb); Note that both have dev_kfree_skb_irq, but the former use gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access. Also note that the crash is in gnttab_end_foreign_access, so before dev_kfree_skb_irq. If that would be double free, I'd expect crash in the later. This change was introduced by cefe007 "xen-netfront: fix resource leak in netfront". I'm not sure if changing gnttab_end_foreign_access back to gnttab_end_foreign_access_ref would not (re)introduce some memory leak. Let me paste again the error message: [ 73.718636] page:ea43b1c0 count:0 mapcount:0 mapping: (null) index:0x0 [ 73.718661] flags: 0x3ffc008000(tail) [ 73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) [ 73.718725] [ cut here ] [ 73.718743] kernel BUG at include/linux/mm.h:338! Also it all look quite strange - there is get_page() call just before gnttab_end_foreign_access, but page->_count is still 0. Maybe it have something to do how get_page() works on "tail" pages (whatever it means)? static inline void get_page(struct page *page) { if (unlikely(PageTail(page))) if (likely(__get_page_tail(page))) return; /* * Getting a normal page or the head of a compound page * requires to already have an elevated page->_count. */ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); atomic_inc(&page->_count); } which (I think) ends up in: static inline void __get_page_tail_foll(struct page *page, bool get_page_head) { /* * If we're getting a tail page, the elevated page->_count is * required only in the head page and we will elevate the head * page->_count and tail page->_mapcount. * * We elevate page_tail->_mapcount for tail pages to force * page_tail->_count to be zero at all times to avoid getting * false positives from get_page_unless_zero() with * speculative page access (like in * page_cache_get_speculative()) on tail pages. */ VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page); if (get_page_head) atomic_inc(&page->first_page->_c
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote: > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > > Hi all, > > > > I'm experiencing xen-netfront crash when doing xl network-detach while > > some network activity is going on at the same time. It happens only when > > domU has more than one vcpu. Not sure if this matters, but the backend > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel > > 3.9.4 and 4.1-rc1 as well. > > > > Steps to reproduce: > > 1. Start the domU with some network interface > > 2. Call there 'ping -f some-IP' > > 3. Call 'xl network-detach NAME 0' > > There's a use-after-free in xennet_remove(). Does this patch fix it? Unfortunately not. Note that the crash is in xennet_disconnect_backend, which is called before xennet_destroy_queues in xennet_remove. I've tried to add napi_disable and even netif_napi_del just after napi_synchronize in xennet_disconnect_backend (which would probably cause crash when trying to cleanup the same later again), but it doesn't help - the crash is the same (still in gnttab_end_foreign_access called from xennet_disconnect_backend). > 8< > xen-netfront: properly destroy queues when removing device > > xennet_remove() freed the queues before freeing the netdevice which > results in a use-after-free when free_netdev() tries to delete the > napi instances that have already been freed. > > Fix this by fully destroy the queues (which includes deleting the napi > instances) before freeing the netdevice. > > Reported-by: Marek Marczykowski > Signed-off-by: David Vrabel > --- > drivers/net/xen-netfront.c | 15 ++- > 1 file changed, 2 insertions(+), 13 deletions(-) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 3f45afd..e031c94 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -1698,6 +1698,7 @@ static void xennet_destroy_queues(struct netfront_info > *info) > > if (netif_running(info->netdev)) > napi_disable(&queue->napi); > + del_timer_sync(&queue->rx_refill_timer); > netif_napi_del(&queue->napi); > } > > @@ -2102,9 +2103,6 @@ static const struct attribute_group xennet_dev_group = { > static int xennet_remove(struct xenbus_device *dev) > { > struct netfront_info *info = dev_get_drvdata(&dev->dev); > - unsigned int num_queues = info->netdev->real_num_tx_queues; > - struct netfront_queue *queue = NULL; > - unsigned int i = 0; > > dev_dbg(&dev->dev, "%s\n", dev->nodename); > > @@ -2112,16 +2110,7 @@ static int xennet_remove(struct xenbus_device *dev) > > unregister_netdev(info->netdev); > > - for (i = 0; i < num_queues; ++i) { > - queue = &info->queues[i]; > - del_timer_sync(&queue->rx_refill_timer); > - } > - > - if (num_queues) { > - kfree(info->queues); > - info->queues = NULL; > - } > - > + xennet_destroy_queues(info); > xennet_free_netdev(info->netdev); > > return 0; -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? pgpX1pAwgNnhD.pgp Description: PGP signature ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > Hi all, > > I'm experiencing xen-netfront crash when doing xl network-detach while > some network activity is going on at the same time. It happens only when > domU has more than one vcpu. Not sure if this matters, but the backend > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel > 3.9.4 and 4.1-rc1 as well. > > Steps to reproduce: > 1. Start the domU with some network interface > 2. Call there 'ping -f some-IP' > 3. Call 'xl network-detach NAME 0' There's a use-after-free in xennet_remove(). Does this patch fix it? 8< xen-netfront: properly destroy queues when removing device xennet_remove() freed the queues before freeing the netdevice which results in a use-after-free when free_netdev() tries to delete the napi instances that have already been freed. Fix this by fully destroy the queues (which includes deleting the napi instances) before freeing the netdevice. Reported-by: Marek Marczykowski Signed-off-by: David Vrabel --- drivers/net/xen-netfront.c | 15 ++- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 3f45afd..e031c94 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -1698,6 +1698,7 @@ static void xennet_destroy_queues(struct netfront_info *info) if (netif_running(info->netdev)) napi_disable(&queue->napi); + del_timer_sync(&queue->rx_refill_timer); netif_napi_del(&queue->napi); } @@ -2102,9 +2103,6 @@ static const struct attribute_group xennet_dev_group = { static int xennet_remove(struct xenbus_device *dev) { struct netfront_info *info = dev_get_drvdata(&dev->dev); - unsigned int num_queues = info->netdev->real_num_tx_queues; - struct netfront_queue *queue = NULL; - unsigned int i = 0; dev_dbg(&dev->dev, "%s\n", dev->nodename); @@ -2112,16 +2110,7 @@ static int xennet_remove(struct xenbus_device *dev) unregister_netdev(info->netdev); - for (i = 0; i < num_queues; ++i) { - queue = &info->queues[i]; - del_timer_sync(&queue->rx_refill_timer); - } - - if (num_queues) { - kfree(info->queues); - info->queues = NULL; - } - + xennet_destroy_queues(info); xennet_free_netdev(info->netdev); return 0; -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Fri, May 22, 2015 at 05:58:41PM +0100, David Vrabel wrote: > On 22/05/15 17:42, Marek Marczykowski-Górecki wrote: > > On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote: > >> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > >>> Hi all, > >>> > >>> I'm experiencing xen-netfront crash when doing xl network-detach while > >>> some network activity is going on at the same time. It happens only when > >>> domU has more than one vcpu. Not sure if this matters, but the backend > >>> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel > >>> 3.9.4 and 4.1-rc1 as well. > >>> > >>> Steps to reproduce: > >>> 1. Start the domU with some network interface > >>> 2. Call there 'ping -f some-IP' > >>> 3. Call 'xl network-detach NAME 0' > >> > >> I tried this about 10 times without a crash. How reproducible is it? > >> > >> I used a 4.1-rc4 frontend and a 4.0 backend. > > > > It happens every time for me... Do you have at least two vcpus in that > > domU? With one vcpu it doesn't crash. The IP for ping I've used one in > > backend domU, but it shouldn't matter. > > > > Backend is 3.19.6 here. I don't see any changes there between rc1 and > > rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me. > > Doesn't repro for me with 4 VCPU PV or PVHVM guests. I've tried with exactly 2 vcpus in frontend domU (PV), but I guess it shouldn't matter. Backend is also PV. > Is your guest > kernel vanilla or does it have some qubes specific patches on top? This one was from vanilla - both frontend and backend (just qubes config). Maybe something about device configuration? Here is xenstore dump: frontend: 0 = "" backend = "/local/domain/66/backend/vif/69/0" backend-id = "66" state = "4" handle = "0" mac = "00:16:3e:5e:6c:07" multi-queue-num-queues = "2" queue-0 = "" tx-ring-ref = "1280" rx-ring-ref = "1281" event-channel-tx = "19" event-channel-rx = "20" queue-1 = "" tx-ring-ref = "1282" rx-ring-ref = "1283" event-channel-tx = "21" event-channel-rx = "22" request-rx-copy = "1" feature-rx-notify = "1" feature-sg = "1" feature-gso-tcpv4 = "1" feature-gso-tcpv6 = "1" feature-ipv6-csum-offload = "1" backend: 69 = "" 0 = "" frontend = "/local/domain/69/device/vif/0" frontend-id = "69" online = "1" state = "4" script = "/etc/xen/scripts/vif-route-qubes" mac = "00:16:3e:5e:6c:07" ip = "10.137.3.9" handle = "0" type = "vif" feature-sg = "1" feature-gso-tcpv4 = "1" feature-gso-tcpv6 = "1" feature-ipv6-csum-offload = "1" feature-rx-copy = "1" feature-rx-flip = "0" feature-split-event-channels = "1" multi-queue-max-queues = "2" hotplug-status = "connected" -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? pgpvS9afJbT0h.pgp Description: PGP signature ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On 22/05/15 17:42, Marek Marczykowski-Górecki wrote: > On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote: >> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: >>> Hi all, >>> >>> I'm experiencing xen-netfront crash when doing xl network-detach while >>> some network activity is going on at the same time. It happens only when >>> domU has more than one vcpu. Not sure if this matters, but the backend >>> is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel >>> 3.9.4 and 4.1-rc1 as well. >>> >>> Steps to reproduce: >>> 1. Start the domU with some network interface >>> 2. Call there 'ping -f some-IP' >>> 3. Call 'xl network-detach NAME 0' >> >> I tried this about 10 times without a crash. How reproducible is it? >> >> I used a 4.1-rc4 frontend and a 4.0 backend. > > It happens every time for me... Do you have at least two vcpus in that > domU? With one vcpu it doesn't crash. The IP for ping I've used one in > backend domU, but it shouldn't matter. > > Backend is 3.19.6 here. I don't see any changes there between rc1 and > rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me. Doesn't repro for me with 4 VCPU PV or PVHVM guests. Is your guest kernel vanilla or does it have some qubes specific patches on top? David ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote: > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > > Hi all, > > > > I'm experiencing xen-netfront crash when doing xl network-detach while > > some network activity is going on at the same time. It happens only when > > domU has more than one vcpu. Not sure if this matters, but the backend > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel > > 3.9.4 and 4.1-rc1 as well. > > > > Steps to reproduce: > > 1. Start the domU with some network interface > > 2. Call there 'ping -f some-IP' > > 3. Call 'xl network-detach NAME 0' > > I tried this about 10 times without a crash. How reproducible is it? > > I used a 4.1-rc4 frontend and a 4.0 backend. It happens every time for me... Do you have at least two vcpus in that domU? With one vcpu it doesn't crash. The IP for ping I've used one in backend domU, but it shouldn't matter. Backend is 3.19.6 here. I don't see any changes there between rc1 and rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? pgpxFQvLjIK22.pgp Description: PGP signature ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] xen-netfront crash when detaching network while some network activity
On 22/05/15 12:49, Marek Marczykowski-Górecki wrote: > Hi all, > > I'm experiencing xen-netfront crash when doing xl network-detach while > some network activity is going on at the same time. It happens only when > domU has more than one vcpu. Not sure if this matters, but the backend > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel > 3.9.4 and 4.1-rc1 as well. > > Steps to reproduce: > 1. Start the domU with some network interface > 2. Call there 'ping -f some-IP' > 3. Call 'xl network-detach NAME 0' I tried this about 10 times without a crash. How reproducible is it? I used a 4.1-rc4 frontend and a 4.0 backend. David ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] xen-netfront crash when detaching network while some network activity
Hi all, I'm experiencing xen-netfront crash when doing xl network-detach while some network activity is going on at the same time. It happens only when domU has more than one vcpu. Not sure if this matters, but the backend is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel 3.9.4 and 4.1-rc1 as well. Steps to reproduce: 1. Start the domU with some network interface 2. Call there 'ping -f some-IP' 3. Call 'xl network-detach NAME 0' The crash message: [ 54.163670] page:ea4bddc0 count:0 mapcount:0 mapping: (null) index:0x0 [ 54.163692] flags: 0x3fff808000(tail) [ 54.163704] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) [ 54.163726] [ cut here ] [ 54.163734] kernel BUG at include/linux/mm.h:343! [ 54.163742] invalid opcode: [#1] SMP [ 54.163752] Modules linked in: [ 54.163762] CPU: 1 PID: 24 Comm: xenwatch Not tainted 4.1.0-rc1-1.pvops.qubes.x86_64 #4 [ 54.163773] task: 8800133c4c00 ti: 880012c94000 task.ti: 880012c94000 [ 54.163782] RIP: e030:[] [] __free_pages+0x4c/0x50 [ 54.163800] RSP: e02b:880012c97be8 EFLAGS: 00010292 [ 54.163808] RAX: 0044 RBX: 77ff8000 RCX: 0044 [ 54.163817] RDX: RSI: RDI: 880013d0ea00 [ 54.163826] RBP: 880012c97be8 R08: 00f2 R09: [ 54.163835] R10: 00f2 R11: 8185efc0 R12: [ 54.163844] R13: 880011814200 R14: 880012f77000 R15: 0004 [ 54.163860] FS: 7f735f0d8740() GS:880013d0() knlGS: [ 54.163870] CS: e033 DS: ES: CR0: 8005003b [ 54.163878] CR2: 01652c50 CR3: 12112000 CR4: 2660 [ 54.163892] Stack: [ 54.163901] 880012c97c08 81184430 0011 0004 [ 54.163922] 880012c97c38 814100c6 87ff 880011f20d88 [ 54.163943] 880011814200 880011f2 880012c97ca8 814d34e6 [ 54.163964] Call Trace: [ 54.163977] [] free_pages+0x60/0x70 [ 54.163994] [] gnttab_end_foreign_access+0x136/0x170 [ 54.164012] [] xennet_disconnect_backend.isra.24+0x166/0x390 [ 54.164030] [] xennet_remove+0x38/0xd0 [ 54.164045] [] xenbus_dev_remove+0x59/0xc0 [ 54.164059] [] __device_release_driver+0x87/0x120 [ 54.164528] [] device_release_driver+0x23/0x30 [ 54.164528] [] bus_remove_device+0x108/0x180 [ 54.164528] [] device_del+0x141/0x270 [ 54.164528] [] ? unregister_xenbus_watch+0x1d0/0x1d0 [ 54.164528] [] device_unregister+0x22/0x80 [ 54.164528] [] xenbus_dev_changed+0xaf/0x200 [ 54.164528] [] ? _raw_spin_unlock_irqrestore+0x16/0x20 [ 54.164528] [] ? unregister_xenbus_watch+0x1d0/0x1d0 [ 54.164528] [] frontend_changed+0x29/0x60 [ 54.164528] [] ? unregister_xenbus_watch+0x1d0/0x1d0 [ 54.164528] [] xenwatch_thread+0x8e/0x150 [ 54.164528] [] ? wait_woken+0x90/0x90 [ 54.164528] [] kthread+0xd8/0xf0 [ 54.164528] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 54.164528] [] ret_from_fork+0x42/0x70 [ 54.164528] [] ? kthread_create_on_node+0x1b0/0x1b0 [ 54.164528] Code: f6 74 0c e8 67 f5 ff ff 5d c3 0f 1f 44 00 00 31 f6 e8 99 fd ff ff 5d c3 0f 1f 80 00 00 00 00 48 c7 c6 78 29 a1 81 e8 d4 37 02 00 <0f> 0b 66 90 66 66 66 66 90 48 85 ff 75 06 f3 c3 0f 1f 40 00 55 [ 54.164528] RIP [] __free_pages+0x4c/0x50 [ 54.164528] RSP [ 54.166002] ---[ end trace 6b847bc27fec6d36 ]--- Any ideas how to fix this? I guess xennet_disconnect_backend should take some lock. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? pgpMz0nWHYgnp.pgp Description: PGP signature ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel