Hi, Eelco and Ilya It has been a long time and I see there are a lot fixes on the revdalidator's code about this statistics code. Has this stats inconsistent issue been solved?
I just realize that making dump seq equal in *try_ukey_replace* is not enough. we might need to clear the old ukey's stats. since we only reuse the old ukey in EVICTED states, which means we have executed delete-megaflow-op on this ukey and the stats info has been pushed. so if there are new megaflow upcalls and replaces the existing ukey's, the ukey->stats should be cleared to sync with the new megaflows. if not, when revaldiator tries to delete this ukey, the push_dp_ops will find inconsistent stats between ukey->stats( as it's synced with old megaflow) and new magaflow, which is just the case you describe here. Peng He <[email protected]> 于2022年5月21日周六 13:02写道: > Hi, Eelco and Ilya, > > We have observed the similar case in our production environments. > > After digging some code, I would like a discussion: > > In the situation that the ovs revalidator is under high load, i.e. the > number of megaflows are too large, > and the revalidators would try to delete some megaflows, at the same time, > there are traffic keeps > generating new megaflows through upcalls. > > the underlying cmap of megaflows changes due to a lot of deletion and > insertion (cmap will change its bucket number, capacity, > and re-place the megaflows in different buckets), resulting in a same > magaflow might be dumped twice in two revalidator threads in one dump stage. > > Consider a case: > > revalidator 1 gets the megaflow A dumped and its related ukey u(A), and > decide to kill the megaflow A. > After megaflow deletion, the pmd generates a new megaflow B with the same > ufid, i.e. the same match > and the same actions, and replace the ukey u(A) with u(B). > > Now revalidator 2 gets the old megaflow A again, and it will find the ukey > u(B). > u(B) is the new ukey, it's dump_seq is 0, and this leads megaflow A is > viewed as megaflow B, and its > statistics are contributed into ukey(B). > > The megaflow A and megaflow B is in essential the same one, the only > difference is the statistics value. > This mismatch of megaflow and ukey has two side effects: > > 1) the megaflow A's statistics have been contributed twice, leading to a > amplified value of openflow rule's > statistics. This is our observation in our environments. > > 2) revalidator 2 might also decides to kill ukey(B), but now its > statistics equals to megaflow A, and thus > results in a mismatch value in the email that Eelco observed. > > To fix the case 1), I think, during the ukey replace, the new ukey > generated by upcall should take the dump_seq > value of the replaced old key, and avoid the old megaflow being dumped and > find the new key. > > and the case 2) is also fixed. > > Any thoughts? > > > Eelco Chaudron <[email protected]> 于2022年2月22日周二 16:10写道: > >> >> >> On 21 Feb 2022, at 12:29, Eelco Chaudron wrote: >> >> > On 17 Feb 2022, at 14:10, Ilya Maximets wrote: >> > >> >> On 1/31/22 11:54, Eelco Chaudron wrote: >> >>> Make sure to only update packet and byte counters when valid, >> >>> or else this could lead to "temporarily/occasionally" >> >>> out-of-sync flow counters. >> >> >> >> There was already the same patch submitted here: >> >> >> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/ >> >> >> >> And I'm still not comfortable with the change, because it seems >> >> like it only hides the underlying datapath problem. Do you know >> >> why exactly datapath stats become lower than previously reported? >> >> >> >> If it's some kind of a statistics flush, that will mean that flow >> >> statistics will not be updated until new stats will catch up to the >> >> old value leading to the flow revalidation and incorrect flow stats >> >> anyway. >> > >> > This was very hard to reproduce, only once out of 20-30 runs if I >> remember correctly. >> > >> > Without the patch, it would sometimes show very high numbers and then >> got updated after a while with the correct numbers. >> > At least this is what I remember, as I did not take any notes, and my >> brain wanted to forget this patchset :) >> > >> > >> > Guess the fix is needed anyway as this behavior was there since day one >> in revalidate_ukey(), e79a6c833. >> > >> > I also see that you got no reply to your comments, so I’ll take another >> stab to make sure this patch is really fixing the problem, or fixing a >> problem, and hiding another TC problem. >> > >> >> We fixed incorrect stats on flow modification for dpdk offload provider >> >> previously here: >> >> >> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/ >> >> >> >> Do we need something similar for tc? >> > >> > I’ll take a close look at the DPDK patch, once I get my reproducer >> going. >> >> Unfortunately, after an afternoon and night or running tests, I can not >> replicate the problem with the weird counters :( >> >> So for now I’ll drop this patch from the set, and hopefully, it will >> resurface when I’m integrating the system-traffic tests into the hardware >> offload set. >> >> <SNIP> >> >> _______________________________________________ >> dev mailing list >> [email protected] >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> > > > -- > hepeng > -- hepeng _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
