On 11/19/21 18:09, Kevin Traynor wrote: > On 19/11/2021 12:34, Kevin Traynor wrote: >> On 19/11/2021 12:25, Ilya Maximets wrote: >>> Hi, Kevin. >>> >> >> Hi Ilya, >> >>> After the recent ALB commits, some of these tests are failing more >>> or less frequently in our CI and on my local setup too. Could you, >>> please, take a look? >>> >> >> Sure, I'll take a look. Thanks for letting me know, I hadn't seen it. >> >> Kevin. >> >>> Below is one of the recent examples. Another test that fails with >>> similar symptoms periodically is 'ALB - min num PMD/RxQ'. >>> It looks like it doesn't actually perform the dry run. At least it >>> doesn't log that. >>> > > It is almost certainly related to the using time/warp to let the PMDs > indicate that they have met the load criteria over 1 minute. My best guess is > that "Flaky" suggests the PMDs are not always finished incrementing the > "overloaded" variable to >= 6 during the timewarp. > Perhaps they are sometimes completing it after the timewarp but before the > test times out and sometimes not. This variable is then checked and gates a > config check/dry-run which leads to the anticipated logs. > > At present the timewarp is to allow the PMD increment the variable 6 times > over the course of 1 minute (ovs-appctl time/warp 60000 10000). > > I have run hundreds of times locally on 2 different systems and it is passing > for me each time. With some extra logs to print the overloaded var for each > PMD when it is checked, I can see overloaded == 6 as expected. > > 2021-11-19T16:33:34.303Z|00074|unixctl|DBG|replying with success, id=0: "" > 2021-11-19T16:33:34.310Z|00075|unixctl|DBG|received request > time/warp["60000","10000"], id=0 > 2021-11-19T16:33:34.330Z|00076|memory|INFO|35676 kB peak resident set size > after 20.2 seconds > 2021-11-19T16:33:34.330Z|00077|memory|INFO|handlers:5 idl-cells:153 ports:2 > revalidators:3 rules:4 > 2021-11-19T16:33:34.361Z|00078|unixctl|DBG|replying with success, id=0: > "warped" > 2021-11-19T16:33:34.371Z|00079|dpif_netdev|INFO|PMD ALB: pmd 1, overloaded > intrvls = 6. > 2021-11-19T16:33:34.371Z|00080|dpif_netdev|INFO|PMD ALB: pmd 1, pmd_rebalance > = true. > 2021-11-19T16:33:34.371Z|00081|dpif_netdev|INFO|PMD ALB: pmd 0, overloaded > intrvls = 6. > 2021-11-19T16:33:34.371Z|00082|dpif_netdev|INFO|PMD ALB: pmd 0, pmd_rebalance > = true. > 2021-11-19T16:33:34.371Z|00083|dpif_netdev|INFO|PMD ALB: about to check > conditions. > 2021-11-19T16:33:34.371Z|00084|dpif_netdev|DBG|PMD auto load balance > performing dry run. > > I'll need to find a way to reproduce next week, maybe we can discuss > further/compare systems etc. In the meantime if anyone has other clues, would > be good to hear.
From my understanding, it fails on slow systems with high concurreny. I can consistently reproduce the issue on my laptop by building with AddressSanitizer and running tests with a high cuncurrency, like this: ./configure CFLAGS="-g -O1 -fno-omit-frame-pointer -fsanitize=address" CC=clang make make check TESTSUITEFLAGS='1000-1040 -j16' I can not reproduce most of the time with -j8. My laptop has 4 cores and 8 threads. > > As a last resort, I can always send a patch to temporarily revert the > timewarp parts of the tests while this issue remains. They are not failing all the time, so I don't think we actually need to revert for now. > > thanks, > Kevin. > >>> Bets regards, Ilya Maximets. >>> >>> >>> # -*- compilation -*- >>> 1026. alb.at:98: testing ALB - PMD/RxQ assignment type ... >>> ../../../tests/alb.at:99: ovsdb-tool create conf.db >>> $abs_top_srcdir/vswitchd/vswitch.ovsschema >>> ../../../tests/alb.at:99: ovsdb-server --detach --no-chdir --pidfile >>> --log-file --remote=punix:$OVS_RUNDIR/db.sock >>> stderr: >>> 2021-11-19T02:22:41Z|00001|vlog|INFO|opened log file >>> /home/runner/work/ovs/ovs/openvswitch-2.16.90/_build/sub/tests/testsuite.dir/1026/ovsdb-server.log >>> ../../../tests/alb.at:99: sed < stderr ' >>> /vlog|INFO|opened log file/d >>> /ovsdb_server|INFO|ovsdb-server (Open vSwitch)/d' >>> ../../../tests/alb.at:99: ovs-vsctl --no-wait init >>> ../../../tests/alb.at:99: ovs-vswitchd --enable-dummy --disable-system >>> --disable-system-route --dummy-numa="0,0" --detach --no-chdir --pidfile >>> --log-file -vvconn -vofproto_dpif -vunixctl >>> stderr: >>> 2021-11-19T02:22:41Z|00001|vlog|INFO|opened log file >>> /home/runner/work/ovs/ovs/openvswitch-2.16.90/_build/sub/tests/testsuite.dir/1026/ovs-vswitchd.log >>> 2021-11-19T02:22:41Z|00002|ovs_numa|INFO|Discovered 2 CPU cores on NUMA >>> node 0 >>> 2021-11-19T02:22:41Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 2 CPU >>> cores >>> 2021-11-19T02:22:41Z|00004|reconnect|INFO|unix:/home/runner/work/ovs/ovs/openvswitch-2.16.90/_build/sub/tests/testsuite.dir/1026/db.sock: >>> connecting... >>> 2021-11-19T02:22:41Z|00005|netlink_socket|INFO|netlink: could not enable >>> listening to all nsid (Operation not permitted) >>> 2021-11-19T02:22:41Z|00006|reconnect|INFO|unix:/home/runner/work/ovs/ovs/openvswitch-2.16.90/_build/sub/tests/testsuite.dir/1026/db.sock: >>> connected >>> ../../../tests/alb.at:99: sed < stderr ' >>> /ovs_numa|INFO|Discovered /d >>> /vlog|INFO|opened log file/d >>> /vswitchd|INFO|ovs-vswitchd (Open vSwitch)/d >>> /reconnect|INFO|/d >>> /dpif_netlink|INFO|Generic Netlink family .ovs_datapath. does not exist/d >>> /ofproto|INFO|using datapath ID/d >>> /netdev_linux|INFO|.*device has unknown hardware address family/d >>> /ofproto|INFO|datapath ID changed to fedcba9876543210/d >>> /dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable/d >>> /netlink_socket|INFO|netlink: could not enable listening to all nsid/d >>> /probe tc:/d >>> /tc: Using policy/d' >>> ../../../tests/alb.at:99: add_of_br 0 add-port br0 p0 \ >>> -- set Interface p0 type=dummy-pmd options:n_rxq=3 \ >>> -- set Open_vSwitch . other_config:pmd-cpu-mask=3 \ >>> -- set open_vswitch . other_config:pmd-auto-lb="true" >>> \ >>> -- set open_vswitch . >>> other_config:pmd-auto-lb-load-threshold=0 >>> alb.at:105: waiting until grep "PMD auto load balance is enabled" >>> ovs-vswitchd.log... >>> 2021-11-19T02:22:41.880Z|00039|dpif_netdev|INFO|PMD auto load balance is >>> enabled, interval 1 mins, pmd load threshold 0%, improvement threshold 25%. >>> alb.at:105: wait succeeded immediately >>> ../../../tests/alb.at:109: ovs-vsctl set Open_vSwitch . >>> other_config:pmd-rxq-assign=roundrobin >>> alb.at:110: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "mode >>> changed to: 'roundrobin'"... >>> 2021-11-19T02:22:41.942Z|00051|dpif_netdev|INFO|Rxq to PMD assignment mode >>> changed to: 'roundrobin'. >>> alb.at:110: wait succeeded immediately >>> ../../../tests/alb.at:113: ovs-vsctl set open_vswitch . >>> other_config:pmd-auto-lb="false" >>> alb.at:114: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance is disabled"... >>> 2021-11-19T02:22:42.026Z|00056|dpif_netdev|INFO|PMD auto load balance is >>> disabled. >>> alb.at:114: wait succeeded immediately >>> ../../../tests/alb.at:116: ovs-vsctl set open_vswitch . >>> other_config:pmd-auto-lb="true" >>> alb.at:117: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance is enabled"... >>> 2021-11-19T02:22:42.060Z|00057|dpif_netdev|INFO|PMD auto load balance is >>> enabled, interval 1 mins, pmd load threshold 0%, improvement threshold 25%. >>> alb.at:117: wait succeeded immediately >>> ../../../tests/alb.at:121: ovs-vsctl set Open_vSwitch . >>> other_config:pmd-rxq-assign=cycles >>> alb.at:122: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "mode >>> changed to: 'cycles'"... >>> 2021-11-19T02:22:42.122Z|00058|dpif_netdev|INFO|Rxq to PMD assignment mode >>> changed to: 'cycles'. >>> alb.at:122: wait succeeded immediately >>> ../../../tests/alb.at:125: ovs-vsctl set open_vswitch . >>> other_config:pmd-auto-lb="false" >>> alb.at:126: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance is disabled"... >>> 2021-11-19T02:22:42.175Z|00063|dpif_netdev|INFO|PMD auto load balance is >>> disabled. >>> alb.at:126: wait succeeded immediately >>> ../../../tests/alb.at:128: ovs-vsctl set open_vswitch . >>> other_config:pmd-auto-lb="true" >>> alb.at:129: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance is enabled"... >>> 2021-11-19T02:22:42.230Z|00064|dpif_netdev|INFO|PMD auto load balance is >>> enabled, interval 1 mins, pmd load threshold 0%, improvement threshold 25%. >>> alb.at:129: wait succeeded immediately >>> ../../../tests/alb.at:132: ovs-vsctl set Open_vSwitch . >>> other_config:pmd-rxq-assign=group >>> alb.at:133: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "mode >>> changed to: 'group'"... >>> 2021-11-19T02:22:42.279Z|00065|dpif_netdev|INFO|Rxq to PMD assignment mode >>> changed to: 'group'. >>> alb.at:133: wait succeeded immediately >>> ../../../tests/alb.at:136: ovs-vsctl set open_vswitch . >>> other_config:pmd-auto-lb="false" >>> alb.at:137: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance is disabled"... >>> 2021-11-19T02:22:42.305Z|00070|dpif_netdev|INFO|PMD auto load balance is >>> disabled. >>> alb.at:137: wait succeeded immediately >>> ../../../tests/alb.at:139: ovs-vsctl set open_vswitch . >>> other_config:pmd-auto-lb="true" >>> alb.at:140: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance is enabled"... >>> 2021-11-19T02:22:42.330Z|00071|dpif_netdev|INFO|PMD auto load balance is >>> enabled, interval 1 mins, pmd load threshold 0%, improvement threshold 25%. >>> alb.at:140: wait succeeded immediately >>> ../../../tests/alb.at:143: ovs-appctl vlog/set dpif_netdev:dbg >>> ../../../tests/alb.at:146: ovs-vsctl set Open_vSwitch . >>> other_config:pmd-rxq-assign=group >>> warped >>> alb.at:149: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance performing dry run."... >>> 2021-11-19T02:22:42.434Z|00078|dpif_netdev|DBG|PMD auto load balance >>> performing dry run. >>> alb.at:149: wait succeeded quickly >>> ../../../tests/alb.at:152: ovs-vsctl set Open_vSwitch . >>> other_config:pmd-rxq-assign=cycles >>> alb.at:153: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "mode >>> changed to: 'cycles'"... >>> 2021-11-19T02:22:42.570Z|00085|dpif_netdev|INFO|Rxq to PMD assignment mode >>> changed to: 'cycles'. >>> alb.at:153: wait succeeded immediately >>> warped >>> alb.at:155: waiting until tail -n +$LINENUM ovs-vswitchd.log | grep "PMD >>> auto load balance performing dry run."... >>> alb.at:155: wait failed after 30 seconds >>> ../../../tests/ovs-macros.at:242: hard failure >>> 1026. alb.at:98: 1026. ALB - PMD/RxQ assignment type (alb.at:98): FAILED >>> (ovs-macros.at:242) >>> >>> >>> >>> Last bits of the ovs-vswitchd.log: >>> >>> 2021-11-19T02:22:42.434Z|00078|dpif_netdev|DBG|PMD auto load balance >>> performing dry run. >>> 2021-11-19T02:22:42.434Z|00079|dpif_netdev|DBG|Core 0 on numa node 0 >>> assigned port 'p0' rx queue 0 (measured processing cycles 0). >>> 2021-11-19T02:22:42.434Z|00080|dpif_netdev|DBG|Core 1 on numa node 0 >>> assigned port 'p0' rx queue 1 (measured processing cycles 0). >>> 2021-11-19T02:22:42.434Z|00081|dpif_netdev|DBG|Core 0 on numa node 0 >>> assigned port 'p0' rx queue 2 (measured processing cycles 0). >>> 2021-11-19T02:22:42.434Z|00082|dpif_netdev|DBG|Current variance 0 Estimated >>> variance 0. >>> 2021-11-19T02:22:42.434Z|00083|dpif_netdev|DBG|Variance improvement 0%. >>> 2021-11-19T02:22:42.434Z|00084|dpif_netdev|DBG|PMD load variance >>> improvement threshold 25% is not met. >>> 2021-11-19T02:22:42.570Z|00085|dpif_netdev|INFO|Rxq to PMD assignment mode >>> changed to: 'cycles'. >>> 2021-11-19T02:22:42.570Z|00086|dpif_netdev|INFO|Performing pmd to rx queue >>> assignment using cycles algorithm. >>> 2021-11-19T02:22:42.570Z|00087|dpif_netdev|INFO|Core 0 on numa node 0 >>> assigned port 'p0' rx queue 0 (measured processing cycles 0). >>> 2021-11-19T02:22:42.570Z|00088|dpif_netdev|INFO|Core 1 on numa node 0 >>> assigned port 'p0' rx queue 1 (measured processing cycles 0). >>> 2021-11-19T02:22:42.570Z|00089|dpif_netdev|INFO|Core 1 on numa node 0 >>> assigned port 'p0' rx queue 2 (measured processing cycles 0). >>> 2021-11-19T02:22:42.581Z|00001|dpif_netdev(pmd-c00/id:6)|DBG|Core 0 >>> processing port 'p0' with queue-id 0 >>> 2021-11-19T02:22:42.584Z|00001|dpif_netdev(pmd-c01/id:7)|DBG|Core 1 >>> processing port 'p0' with queue-id 2 >>> 2021-11-19T02:22:42.584Z|00002|dpif_netdev(pmd-c01/id:7)|DBG|Core 1 >>> processing port 'p0' with queue-id 1 >>> 2021-11-19T02:22:42.598Z|00090|unixctl|DBG|received request >>> time/warp["60000","10000"], id=0 >>> 2021-11-19T02:22:42.651Z|00091|unixctl|DBG|replying with success, id=0: >>> "warped" >>> 2021-11-19T02:23:13.232Z|00092|unixctl|DBG|received request >>> exit["--cleanup"], id=0 >>> 2021-11-19T02:23:13.232Z|00093|unixctl|DBG|replying with success, id=0: "" >>> 2021-11-19T02:23:13.233Z|00094|bridge|INFO|bridge br0: deleted interface >>> br0 on port 65534 >>> 2021-11-19T02:23:13.236Z|00095|bridge|INFO|bridge br0: deleted interface p0 >>> on port 1 >>> 2021-11-19T02:23:13.244Z|00003|dpif_netdev(pmd-c01/id:7)|DBG|Core 1 >>> processing port 'p0' with queue-id 2 >>> 2021-11-19T02:23:13.244Z|00004|dpif_netdev(pmd-c01/id:7)|DBG|Core 1 >>> processing port 'p0' with queue-id 1 >>> 2021-11-19T02:23:13.253Z|00002|dpif_netdev(pmd-c00/id:6)|DBG|Core 0 >>> processing port 'p0' with queue-id 0 >>> 2021-11-19T02:23:13.256Z|00096|dpif_netdev|INFO|Performing pmd to rx queue >>> assignment using cycles algorithm. >>> 2021-11-19T02:23:13.256Z|00097|dpif_netdev|INFO|Core 0 on numa node 0 >>> assigned port 'p0' rx queue 0 (measured processing cycles 0). >>> 2021-11-19T02:23:13.256Z|00098|dpif_netdev|INFO|Core 1 on numa node 0 >>> assigned port 'p0' rx queue 1 (measured processing cycles 0). >>> 2021-11-19T02:23:13.256Z|00099|dpif_netdev|INFO|Core 1 on numa node 0 >>> assigned port 'p0' rx queue 2 (measured processing cycles 0). >>> 2021-11-19T02:23:13.257Z|00100|dpif_netdev|INFO|PMD thread on numa_id: 0, >>> core id: 1 destroyed. >>> 2021-11-19T02:23:13.258Z|00101|dpif_netdev|INFO|PMD thread on numa_id: 0, >>> core id: 0 destroyed. >>> >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
