Hi, I have tried test #50 for 200 times, the overflow as well as stats of 0 did not show up.
the command line is: for i in {1..100}; do echo $i; make check-offloads TESTSUITEFLAGS="-v 50" || break; done I am using Debian 10, and my kernel version is root@ubuntu:~/ovs# uname -r 5.4.0-121-generic I doubt the overflow might be related to kernel datapath. IIRC, the kernel datapath does not support some IPv6 ND/NA match. In the logs, there are some warning saying that "failed to get/del/put some datapath flow", which is all related the ICMPv6. It seems to prove the missing of the support of NA/ND match in datapath. since the ICMPv6 is not related to the test itself. I did some modification to the test to drop all IPv6 traffic. Below is my complete diff to the #50: diff --git a/tests/system-traffic.at b/tests/system-traffic.at index dfbc30e47..b759c4bb9 100644 --- a/tests/system-traffic.at +++ b/tests/system-traffic.at @@ -1762,7 +1762,7 @@ on_exit 'rm -f payload200.bin' AT_CHECK([ovs-ofctl del-flows br0]) AT_DATA([flows.txt], [dnl -priority=99,in_port=1,actions=output(port=2,max_len=100),output(port=3,max_len=100) +priority=99,in_port=1,ip,actions=output(port=2,max_len=100),output(port=3,max_len=100) priority=99,in_port=2,udp,actions=output(port=1,max_len=100) priority=1,in_port=4,ip,actions=drop priority=1,actions=drop @@ -1771,6 +1771,8 @@ AT_CHECK([ovs-ofctl add-flows br0 flows.txt]) AT_CHECK([ovs-ofctl del-flows br-underlay]) AT_DATA([flows-underlay.txt], [dnl +priority=99,arp,in_port=1,actions=LOCAL +priority=99,arp,in_port=LOCAL,actions=1 priority=99,dl_type=0x0800,nw_proto=47,in_port=1,actions=LOCAL priority=99,dl_type=0x0800,nw_proto=47,in_port=LOCAL,ip_dst= 172.31.1.1/24,actions=1 priority=1,actions=drop @@ -1787,7 +1789,7 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep "in_port=2" | sed -n 's/.*\(n\_bytes=[ n_bytes=242 ]) dnl After truncation = outer ETH(14) + outer IP(20) + GRE(4) + 100 = 138B -AT_CHECK([ovs-ofctl dump-flows br-underlay | grep "in_port=LOCAL" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl +AT_CHECK([ovs-ofctl dump-flows br-underlay | grep "ip,in_port=LOCAL" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl n_bytes=138 ]) @@ -1822,7 +1824,7 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep "in_port=2" | sed -n 's/.*\(n\_bytes=[ n_bytes=242 ]) dnl After truncation = outer ETH(14) + outer IP(20) + GRE(4) + 100 = 138B -AT_CHECK([ovs-ofctl dump-flows br-underlay | grep "in_port=LOCAL" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl +AT_CHECK([ovs-ofctl dump-flows br-underlay | grep "ip,in_port=LOCAL" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl n_bytes=138 ]) @@ -1834,7 +1836,11 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep "in_port=4" | ofctl_strip], [0], [dnl n_packets=1, n_bytes=100, priority=1,ip,in_port=4 actions=drop ]) -OVS_TRAFFIC_VSWITCHD_STOP +OVS_TRAFFIC_VSWITCHD_STOP(["dnl +/.*lost packet on port.*/d +/.failed to flow_get.*/d +/.failed to flow_del.*/d +/.*Failed to acquire udpif_key corresponding to unexpected flow.*/d"]) AT_CLEANUP It seems that the overflow here is not related to the race in revalidators, but it is related to the design of the testsuite and also the missing of the support of NA/ND in kernel datapath. Eelco Chaudron <echau...@redhat.com> 于2022年9月30日周五 23:47写道: > On 30 Sep 2022, at 17:41, Peng He wrote: > > > It's so easy to reproduce ? > > > > Thanks! then I should have to dig it again. > > It’s been on my to-do for a while but did not get to it. Sometimes it > reproduced easily, and sometimes it takes 100+ runs :( > > If you apply the following series: > > https://patchwork.ozlabs.org/project/openvswitch/list/?series=316861 > > And then the diff below you can run something like: > > sudo bash -c 'for i in {1..100}; do make check-offloads > TESTSUITEFLAGS="49 50" > > 49: datapath - truncate and output to gre tunnel by simulated packets ok > 50: datapath - truncate and output to gre tunnel ok > > And after a while, it will fail with the error (sometimes also 0 bytes, I > hope it’s the same issue :): > > system-traffic.at:1663: wait failed after 30 seconds > n_bytes=18446744073709393054 > > > > $ git diff > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c > index 57f94df54..52d61d0a4 100644 > --- a/ofproto/ofproto-dpif-upcall.c > +++ b/ofproto/ofproto-dpif-upcall.c > @@ -1868,6 +1868,7 @@ try_ukey_replace(struct umap *umap, struct udpif_key > *old_ukey, > ovs_mutex_lock(&new_ukey->mutex); > cmap_replace(&umap->cmap, &old_ukey->cmap_node, > &new_ukey->cmap_node, new_ukey->hash); > + new_ukey->dump_seq = old_ukey->dump_seq; > ovsrcu_postpone(ukey_delete__, old_ukey); > transition_ukey(old_ukey, UKEY_DELETED); > transition_ukey(new_ukey, UKEY_VISIBLE); > diff --git a/tests/system-offloads-testsuite.at b/tests/ > system-offloads-testsuite.at > index 318e6d1e6..4d546011d 100644 > --- a/tests/system-offloads-testsuite.at > +++ b/tests/system-offloads-testsuite.at > @@ -114,8 +114,8 @@ conntrack - DNAT load balancing with NC > > # Occasionalt we fail with extreme high byte counters, i.e. > # n_bytes=18446744073705804134 > -datapath - truncate and output to gre tunnel by simulated packets > -datapath - truncate and output to gre tunnel > +#datapath - truncate and output to gre tunnel by simulated packets > +#datapath - truncate and output to gre tunnel > " > echo "$ovs_test_skip_list" | sed "s/<SPC>/ /g"]) > > diff --git a/tests/system-traffic.at b/tests/system-traffic.at > index 528d2ca64..e86699d1d 100644 > --- a/tests/system-traffic.at > +++ b/tests/system-traffic.at > @@ -1708,7 +1708,10 @@ OVS_REVALIDATOR_PURGE() > OVS_WAIT_UNTIL([ovs-ofctl dump-flows br0 | grep "in_port=4" | ofctl_strip > | grep "n_packets=1, n_bytes=100, priority=1,ip,in_port=4 actions=drop"], > [ovs-ofctl dump-flows br0 | grep "in_port=4" | > ofctl_strip]) > > -OVS_TRAFFIC_VSWITCHD_STOP > +OVS_TRAFFIC_VSWITCHD_STOP(["dnl > +/.*lost packet on handler.*/d > +/.failed to flow_get.*/d > +/.*Failed to acquire udpif_key corresponding to unexpected flow.*/d"]) > AT_CLEANUP > > dnl Create 2 bridges and 2 namespaces to test truncate over > @@ -1834,7 +1837,10 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep > "in_port=4" | ofctl_strip], [0], [dnl > n_packets=1, n_bytes=100, priority=1,ip,in_port=4 actions=drop > ]) > > -OVS_TRAFFIC_VSWITCHD_STOP > +OVS_TRAFFIC_VSWITCHD_STOP(["dnl > +/.*lost packet on handler.*/d > +/.failed to flow_get.*/d > +/.*Failed to acquire udpif_key corresponding to unexpected flow.*/d"]) > AT_CLEANUP > > AT_SETUP([datapath - configure cache size]) > > > > > Eelco Chaudron <echau...@redhat.com> 于2022年9月30日周五 23:39写道: > > > >> > >> > >> On 30 Sep 2022, at 17:26, Peng He wrote: > >> > >>> Eelco Chaudron <echau...@redhat.com> 于2022年9月30日周五 23:01写道: > >>> > >>>> > >>>> > >>>> On 23 Sep 2022, at 18:29, Peng He wrote: > >>>> > >>>>> The userspace datapath mananges all the magaflows by a cmap. The cmap > >>>>> data structrue will grow/shrink during the datapath processing and it > >>>>> will re-position megaflows. This might result in two revalidator > >> threads > >>>>> might process a same megaflow during one dump stage. > >>>>> > >>>>> Consider a situation that, revalidator 1 processes a megaflow A, and > >>>>> decides to delete it from the datapath, at the mean time, this > megaflow > >>>>> A is also queued in the process batch of revalidator 2. Normally it's > >> ok > >>>>> for revalidators to process the same megaflow multiple times, as the > >>>>> dump_seq shows it's already dumped and the stats will not be > >> contributed > >>>>> twice. > >>>>> > >>>>> Assume that right after A is deleted, a PMD thread generates again > >>>>> a new megaflow B which has the same match and action of A. The ukey > >>>>> of megaflow B will replace the one of megaflow A. Now the ukey B is > >>>>> new to the revalidator system and its dump seq is 0. > >>>>> > >>>>> Now since the dump seq of ukey B is 0, when processing megaflow A, > >>>>> the revalidator 2 will not identify this megaflow A has already been > >>>>> dumped by revalidator 1 and will contribute the old megaflow A's > stats > >>>>> again, this results in an inconsistent stats between ukeys and > >> megaflows. > >>>>> > >>>>> To fix this, the newly generated the ukey B should take the dump_seq > >>>>> of the replaced ukey A to avoid a same megaflow being revalidated > >>>>> twice in one dump stage. > >>>>> > >>>>> We observe in the production environment, the OpenFlow rules' stats > >>>>> sometimes are amplified compared to the actual value. I believe this > >>>>> is also the reason that why somtimes there is mismatch between the > >>>>> ukey and megaflow in stats value. The Eelco's patch > >>>>> [ovs-dev] [PATCH v2 09/10] revalidator: Fix datapath statistics > update > >>>>> tried to fix it in the past. > >>>> > >>>> This sounds plausible, are your statistics extremely elevated? > >>>> Mine are in the likes of n_bytes=18446744073705804134 where it should > be > >>>> around 100. > >>>> > >>>> It looks more like an overflow. > >>> I just get the report from another team, I need to ask them, we will > >> have a > >>> 7 days off due to the national day. > >>> so it will take time to get the value. :( > >> > >> Enjoy your time off!! This fix is not solving my problem; > >> > >> Still get the error once out of X runs, n_bytes=18446744073709393054 > >> > >>> I’ll try to get my old setup up and run it continuously over the > weekend > >>>> and see if it’s replicated again. > >>>> > >>>> thanks! > >>> > >>> > >>>>> Signed-off-by: Peng He <hepeng.0...@bytedance.com> > >>>>> --- > >>>>> ofproto/ofproto-dpif-upcall.c | 1 + > >>>>> 1 file changed, 1 insertion(+) > >>>>> > >>>>> diff --git a/ofproto/ofproto-dpif-upcall.c > >>>> b/ofproto/ofproto-dpif-upcall.c > >>>>> index e8bbcfeaf..89fad1bdf 100644 > >>>>> --- a/ofproto/ofproto-dpif-upcall.c > >>>>> +++ b/ofproto/ofproto-dpif-upcall.c > >>>>> @@ -1877,6 +1877,7 @@ try_ukey_replace(struct umap *umap, struct > >>>> udpif_key *old_ukey, > >>>>> ovs_mutex_lock(&new_ukey->mutex); > >>>>> cmap_replace(&umap->cmap, &old_ukey->cmap_node, > >>>>> &new_ukey->cmap_node, new_ukey->hash); > >>>>> + new_ukey->dump_seq = old_ukey->dump_seq; > >>>>> ovsrcu_postpone(ukey_delete__, old_ukey); > >>>>> transition_ukey(old_ukey, UKEY_DELETED); > >>>>> transition_ukey(new_ukey, UKEY_VISIBLE); > >>>>> -- > >>>>> 2.25.1 > >>>>> > >>>>> _______________________________________________ > >>>>> dev mailing list > >>>>> d...@openvswitch.org > >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>>> > >>>> > >>> > >>> -- > >>> hepeng > >> > >> > > > > -- > > hepeng > > -- hepeng _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev