On Tue, Jan 5, 2016 at 12:56 PM, Jesse Gross <[email protected]> wrote:
> OK, thanks for looking some more. I agree that it seems we are getting > NULL back from ovs_internal_dev_get_vport(). I think it is still most > likely that ovs_netdev_get_vport() is returning NULL. Since you are > running 3.10, the condition that this might happen is if > dev->priv_flags does not have IFF_OVS_DATAPATH. > Ack! > > Either this or the alternative (!ovs_is_internal_dev()) should never > happen unless there is a bug somewhere else. As a result, I don't > really want to just have a NULL check in internal_dev_get_stats() > since it will just paper over the problem. > > Understood! Yeah, defensive coding is not the answer for sure here. > If I'm reading your previous messages correctly, it sounds like the > problem occurs when you try to open the OVS internal device as if it > is a tap. Is that right? I would expect that to simply fail but > perhaps it is clearing the IFF_OVS_DATAPATH flag as part of the > process. > That is right. I only see this happening when I set the type of the interface to internal and then attempt to use it as a bridged interface for the VM. So let me learn more about "IFF_OVS_DATAPATH flag" to better appreciate your feedback. Many thanks Jessie! -- flavio > > On Tue, Jan 5, 2016 at 7:52 AM, Flavio Fernandes <[email protected]> > wrote: > > Hi Jesse, > > > > Based on your feedback, I went back and did some more digging to find > out if > > vport param is null > > or some other garbage. > > > > Looking at the info below, 0x60 is exactly the offset for "rx_errors". > > > > Also, looking at the 'dis' output, you can see that vport is in rdi,r12 > and > > that is indeed 0. > > > > Do you think I'm missing something else here? If not, i wonder if you > agree > > that a simple > > check for vport == null to handle internal ovs interfaces is back on the > > table. > > > > Thanks, > > > > -- flavio > > > > ---- > > > > # yum install -y yum-utils crash > > # uname -r > > 3.10.0-327.3.1.el7.x86_64 > > > > # debuginfo-install kernel-3.10.0-327.3.1.el7.x86_64 > > # cd /var/crash/127.0.0.1-2015-12-27-16:04:53 > > # crash vmcore /usr/lib/debug/lib/modules/`uname -r`/vmlinux > > > > ---- > > > > crash> set radix 16 > > > > crash> whatis struct vport.err_stats > > struct vport { > > [0x58] struct vport_err_stats err_stats; > > } > > > > crash> whatis vport_err_stats.rx_errors > > struct vport_err_stats { > > [0x8] atomic_long_t rx_errors; > > } > > > > crash> p 0x58 + 0x8 > > $3 = 0x60 > > > > ---- > > > > https://gist.github.com/00f5f998997a9ebd75e2 > > > > > > crash> mod -s openvswitch > > MODULE NAME SIZE OBJECT FILE > > ffffffffa04f3540 openvswitch 84543 > > > /lib/modules/3.10.0-327.3.1.el7.x86_64/kernel/net/openvswitch/openvswitch.ko > > > > crash> set radix 10 > > output radix: 10 (decimal) > > > > crash> dis -l ovs_vport_get_stats > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 292 > > 0xffffffffa04eea90 <ovs_vport_get_stats>: data32 data32 data32 xchg > > %ax,%ax [FTRACE NOP] > > 0xffffffffa04eea95 <ovs_vport_get_stats+5>: push %rbp > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 295 > > 0xffffffffa04eea96 <ovs_vport_get_stats+6>: test $0x1,%sil > > 0xffffffffa04eea9a <ovs_vport_get_stats+10>: mov $0x40,%edx > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 292 > > 0xffffffffa04eea9f <ovs_vport_get_stats+15>: mov %rsp,%rbp > > 0xffffffffa04eeaa2 <ovs_vport_get_stats+18>: push %r13 > > 0xffffffffa04eeaa4 <ovs_vport_get_stats+20>: push %r12 > > 0xffffffffa04eeaa6 <ovs_vport_get_stats+22>: mov %rdi,%r12 > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 295 > > 0xffffffffa04eeaa9 <ovs_vport_get_stats+25>: mov %rsi,%rdi > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 292 > > 0xffffffffa04eeaac <ovs_vport_get_stats+28>: push %rbx > > 0xffffffffa04eeaad <ovs_vport_get_stats+29>: mov %rsi,%rbx > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 295 > > 0xffffffffa04eeab0 <ovs_vport_get_stats+32>: jne 0xffffffffa04eeb83 > > <ovs_vport_get_stats+243> > > 0xffffffffa04eeab6 <ovs_vport_get_stats+38>: test $0x2,%dil > > 0xffffffffa04eeaba <ovs_vport_get_stats+42>: jne 0xffffffffa04eeb96 > > <ovs_vport_get_stats+262> > > 0xffffffffa04eeac0 <ovs_vport_get_stats+48>: test $0x4,%dil > > 0xffffffffa04eeac4 <ovs_vport_get_stats+52>: jne 0xffffffffa04eebad > > <ovs_vport_get_stats+285> > > 0xffffffffa04eeaca <ovs_vport_get_stats+58>: mov %edx,%ecx > > 0xffffffffa04eeacc <ovs_vport_get_stats+60>: xor %eax,%eax > > 0xffffffffa04eeace <ovs_vport_get_stats+62>: shr $0x3,%ecx > > 0xffffffffa04eead1 <ovs_vport_get_stats+65>: test $0x4,%dl > > 0xffffffffa04eead4 <ovs_vport_get_stats+68>: rep stos %rax,%es:(%rdi) > > 0xffffffffa04eead7 <ovs_vport_get_stats+71>: je 0xffffffffa04eeae3 > > <ovs_vport_get_stats+83> > > 0xffffffffa04eead9 <ovs_vport_get_stats+73>: movl $0x0,(%rdi) > > 0xffffffffa04eeadf <ovs_vport_get_stats+79>: add $0x4,%rdi > > 0xffffffffa04eeae3 <ovs_vport_get_stats+83>: test $0x2,%dl > > 0xffffffffa04eeae6 <ovs_vport_get_stats+86>: je 0xffffffffa04eeaf2 > > <ovs_vport_get_stats+98> > > 0xffffffffa04eeae8 <ovs_vport_get_stats+88>: xor %eax,%eax > > 0xffffffffa04eeaea <ovs_vport_get_stats+90>: add $0x2,%rdi > > 0xffffffffa04eeaee <ovs_vport_get_stats+94>: mov %ax,-0x2(%rdi) > > 0xffffffffa04eeaf2 <ovs_vport_get_stats+98>: and $0x1,%edx > > 0xffffffffa04eeaf5 <ovs_vport_get_stats+101>: je 0xffffffffa04eeafa > > <ovs_vport_get_stats+106> > > 0xffffffffa04eeaf7 <ovs_vport_get_stats+103>: movb $0x0,(%rdi) > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/arch/x86/include/asm/atomic64_64.h: > > 21 > > 0xffffffffa04eeafa <ovs_vport_get_stats+106>: mov 0x60(%r12),%rax > > 0xffffffffa04eeaff <ovs_vport_get_stats+111>: mov > > -0x1ee90836(%rip),%r13 # 0xffffffff8165e2d0 <cpu_possible_mask> > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 311 > > 0xffffffffa04eeb06 <ovs_vport_get_stats+118>: mov $0xffffffff,%edx > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 306 > > 0xffffffffa04eeb0b <ovs_vport_get_stats+123>: mov %rax,0x20(%rbx) > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/arch/x86/include/asm/atomic64_64.h: > > 21 > > 0xffffffffa04eeb0f <ovs_vport_get_stats+127>: mov 0x70(%r12),%rax > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 307 > > 0xffffffffa04eeb14 <ovs_vport_get_stats+132>: mov %rax,0x28(%rbx) > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/arch/x86/include/asm/atomic64_64.h: > > 21 > > 0xffffffffa04eeb18 <ovs_vport_get_stats+136>: mov 0x68(%r12),%rax > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c: > > 308 > > 0xffffffffa04eeb1d <ovs_vport_get_stats+141>: mov %rax,0x38(%rbx) > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/arch/x86/include/asm/atomic64_64.crash> > > > > == > > > > crash> bt -f > > … > > #8 [ffff88082c1df7c0] page_fault at ffffffff8163d388 > > [exception RIP: ovs_vport_get_stats+106] > > RIP: ffffffffa04eeafa RSP: ffff88082c1df870 RFLAGS: 00010246 > > RAX: 0000000000000000 RBX: ffff88082c1df898 RCX: 0000000000000000 > > RDX: 0000000000000000 RSI: ffff88082c1df898 RDI: ffff88082c1df8d8 > > RBP: ffff88082c1df888 R8: ffff88084d9ed010 R9: ffff880853324118 > > R10: ffff88085f003400 R11: 0000000000000048 R12: 0000000000000000 > > R13: 0000000000000008 R14: ffff880853324000 R15: ffff8808533240b8 > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > ffff88082c1df7c8: ffff8808533240b8 ffff880853324000 > > ffff88082c1df7d8: 0000000000000008 0000000000000000 > > ffff88082c1df7e8: ffff88082c1df888 ffff88082c1df898 > > ffff88082c1df7f8: 0000000000000048 ffff88085f003400 > > ffff88082c1df808: ffff880853324118 ffff88084d9ed010 > > ffff88082c1df818: 0000000000000000 0000000000000000 > > ffff88082c1df828: 0000000000000000 ffff88082c1df898 > > ffff88082c1df838: ffff88082c1df8d8 ffffffffffffffff > > ffff88082c1df848: ffffffffa04eeafa 0000000000000010 > > ffff88082c1df858: 0000000000010246 ffff88082c1df870 > > ffff88082c1df868: 0000000000000018 ffff88082c1dfa58 > > ffff88082c1df878: ffff880035e2b000 0000000000000008 > > ffff88082c1df888: ffff88082c1df8e8 ffffffffa04ef079 > > #9 [ffff88082c1df890] internal_dev_get_stats at ffffffffa04ef079 > > [openvswitch] > > ffff88082c1df898: 0000000000000000 0000000000000000 > > ffff88082c1df8a8: 0000000000000000 0000000000000000 > > crash> > > > > == > > > > $ sed -n 291,308p > > > /usr/src/debug/kernel-3.10.0-327.3.1.el7/linux-3.10.0-327.3.1.el7.x86_64/net/openvswitch/vport.c > > void ovs_vport_get_stats(struct vport *vport, struct ovs_vport_stats > *stats) > > { > > int i; > > > > memset(stats, 0, sizeof(*stats)); > > > > /* We potentially have 2 sources of stats that need to be combined: > > * those we have collected (split into err_stats and percpu_stats) from > > * set_stats() and device error stats from netdev->get_stats() (for > > * errors that happen downstream and therefore aren't reported through > > * our vport_record_error() function). > > * Stats from first source are reported by ovs (OVS_VPORT_ATTR_STATS). > > * netdev-stats can be directly read over netlink-ioctl. > > */ > > > > stats->rx_errors = atomic_long_read(&vport->err_stats.rx_errors); > > <== line 306 > > stats->tx_errors = atomic_long_read(&vport->err_stats.tx_errors); > > stats->tx_dropped = atomic_long_read(&vport->err_stats.tx_dropped); <== > > line 308 > > [dell:devstack-nodes-master.git] (master.wip)$ > > > > ==== > > > > > > > > On Mon, Jan 4, 2016 at 11:30 PM, Flavio Fernandes <[email protected]> > > wrote: > >> > >> > >> > >> On Mon, Jan 4, 2016 at 8:47 PM, Jesse Gross <[email protected]> wrote: > >>> > >>> On Mon, Jan 4, 2016 at 1:41 PM, Flavio Fernandes <[email protected]> > >>> wrote: > >>> > So, I'm a happy camper, but can't help but worry a little about the > >>> > fragility of the > >>> > system when one attempts to use a port type internal 'directly' as > >>> > bridged. > >>> > The fix > >>> > I have in mind is relatively simple: add a check in > >>> > internal_dev_get_stats > >>> > to gracefully handle cases when ovs_internal_dev_get_vport returns > >>> > null. Too > >>> > simple? > >>> > >>> I don't think that the problem is simply that we are returning NULL > >>> from ovs_internal_dev_get_vport(). ovs_internal_dev_get_vport() should > >>> never return NULL to internal_dev_get_stats() because it is checking > >>> whether the device has a ops structure that is equal to the one that > >>> leads to internal_dev_get_stats(). And in fact, if you look at the > >>> full stack trace, the address being dereferenced is 0x0000000000000060 > >>> rather than 0x0 from a real NULL. > >> > >> > >> ack. If ovs_internal_dev_get_vport() is not returning NULL then this is > >> not as simple as what I was interpreting. My thinking was that 0x60 is > the > >> offset of > >> > >> &vport->err_stats.rx_errors > >> > >> from line 306 in > >> http://lxr.oss.org.cn/source/net/openvswitch/vport.c#L306 > >> but you may be right in that if vport was not NULL, then this is an > issue > >> in > >> what ovs_internal_dev_get_vport() is returning. > >> > >> > >>> > >>> This looks like something is overwriting the vport pointer in the > >>> device structure. If you follow where this is coming from you'll wind > >>> up at ovs_netdev_get_vport() which is a maze of twisty conditions that > >>> depend on what kernel version you are using. Particularly on the RHEL > >>> kernels (which based on your email address I'm guessing you're using), > >>> the pointer is stashed in a variety of places. My guess is that these > >>> are not entirely safe in some conditions - likely related to tap > >>> devices based on your other description. I think the best path forward > >>> is to try to see which of the conditions your kernel version falls > >>> into and try to see what might be stomping on the pointer. > >> > >> > >> I see. So it could be I'm looking at the wrong source code. I am > >> using Centos 7.2 kernel (3.10.0-327.3.1.el7.x86_64 x86_64); I will > >> find out more about how that differs from upstream kernel. > >> > >> THANKS Jesse! > >> > >> -- flaviof > >> > >> > >> > > >
_______________________________________________ discuss mailing list [email protected] http://openvswitch.org/mailman/listinfo/discuss
