I also encounter the same problem as Liang Dong did.
  http://thread.gmane.org/gmane.linux.network.openvswitch.general/11704
 
 
  Just copy the message body from Liang Dong.
 
 
  Hi
 
 
  We have found a very strange bug in Open vSwitch, when it is connected to a 
Cisco Switch port, the port will randomly get err-disabled.
 
 
  So we have 76 Debian servers installed with Open vSwitch (2.4.0), each 
connected an port in Cisco Switch 3110. There will be a chance of err-disabled 
port on Cisco Switch every week or two. From Cisco switch perspective, the port 
was disabled because detecting an loopback by receiving a keepalive message 
which was originated from the cisco switch port.
 
 
  Basically the keepalive message was like below:
 
 
  11:37:01.749102 e8:04:62:c8:6e:81 e8:04:62:c8:6e:81, ethertype Loopback 
(0x9000), length 60: Loopback, skipCount 0, Reply, receipt number 0, data (40 
octets)
  0x0000: 0000 0100 0000 0000 0000 0000 0000 0000 ................
  0x0010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
  0x0020: 0000 0000 0000 0000 0000 0000 0000  ..............
 
 
  Our first guess was that Open vSwitch accidentally sends the keepalive 
message it received back to the port and leads to err-disabled state. Normally 
the Open vSwitch will discard this message, but once a week or two in 76 
servers, it will get back to the port on the cisco switch and the port will be 
err-disabled.
 
 
  The work around we are using now are either disabling sending keepalive 
message on cisco switch or explicitly add a flow rule for discarding that 
keepalive message on Open vSwitch.

 It's really hard to debug problems that are intermittent and require
 specific hardware. If you can eliminate one of those parts of the
 problem, then it's easier to deal with. To attack the intermittent
 part, perhaps you could make the Cisco switch send these keepalive
 messages much more frequently. To attack the specific hardware part,
 maybe you could reproduce this by sending similar keepalive messages in
 software and demonstrate that sometimes OVS sends them back.


OVS generates a dp flow to drop the loopback packet by defaut,
shown as below,
recirc_id(0),in_port(3),eth(src=b0:00:b4:67:17:9b,dst=b0:00:b4:67:17:9b),eth_type(0x9000),
 packets:5, bytes:300, used:0.476s, actions:drop


port(3) is the port which connected by the physical nic.
Is it possible that the loopback packet can be out from ovs switch, i.e., out 
from port(3) in this case?


We also deployed about 500 kvm host with ovs vlan network, this problem 
happpened too.
ovs overview in one host shown as below,
$ ovs-vsctl show
248fd0b2-6c73-4a5a-b5c6-ec4af56fe569
  Bridge br-int
    fail_mode: secure
    Port "tapa74a924e-e5"
      tag: 1
      Interface "tapa74a924e-e5"
    Port br-int
      Interface br-int
        type: internal
    Port "tapfe684b50-b0"
      tag: 1
      Interface "tapfe684b50-b0"
    Port "int-br100"
      Interface "int-br100"
        type: patch
        options: {peer="phy-br100"}
    Port "tap8d7dcd11-86"
      tag: 1
      Interface "tap8d7dcd11-86"
    Port "tapaac394c0-ae"
      tag: 1
      Interface "tapaac394c0-ae"
  Bridge "br100"
    Port "enp2s0"
      Interface "enp2s0"
    Port "br100"
      Interface "br100"
        type: internal
    Port "phy-br100"
      Interface "phy-br100"
        type: patch
        options: {peer="int-br100"}
  ovs_version: "2.4.0"
Now, I plan to do below jobs to debug this problem,
1) tcpdump on enp2s0 to capture the loopback packet, tcpdump command
tcpdump -i enp2s0 -e -nn "not ip and not arp" -w /home/enp2s0.pcap


2) set port mirror for select_src_port of port enp2s0 in br100, and veth0 
connecting to the mirror port,
then tcpdump on veth0, tcpdump command
tcpdump -i veth0 -e -nn "not ip and not arp" -w /home/veth0.pcap


3) set port mirror for select_dst_port of port enp2s0 in br100, and veth1 
connecting to the mirror port,
then tcpdump on veth0, tcpdump command
tcpdump -i veth0 -e -nn "not ip and not arp" -w /home/veth1.pcap


If the physical switch port was set to err-disabled, means that it received the 
loopback packet returned from host,
then checking enp2s0.pcap, veth0.pcap, veth1.pcap,
if the loopback packet found in veth1.pcap and veth0.pcap, we can conclude that 
this problem is caused by ovs.
if the loopback packet not found in veth1.pcap, but found in enp2s0.pcap, we 
can conclude that
this problem is caused by physical nic or nic driver, even kernel.


Am I right?


Thanks,
Zhang Haoyu
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to