On Wed, Feb 17, 2021 at 2:26 AM Ilya Maximets <[email protected]> wrote:
>
> On 2/16/21 8:48 PM, [email protected] wrote:
> > From: Numan Siddique <[email protected]>
> >
> > In one of the scaled deployments, ovn-controller is asserting with the
> > below stack trace
> >
> > ***
> > (gdb) bt
> > 0 raise () from /lib64/libc.so.6
> > 1 abort () from /lib64/libc.so.6
> > 2 ovs_abort_valist ("%s: assertion %s failed in %s()") at lib/util.c:419
> > 3 vlog_abort_valist ("%s: assertion %s failed in %s()") at
> > lib/vlog.c:1249
> > 4 vlog_abort ("%s: assertion %s failed in %s()") at lib/vlog.c:1263
> > 5 ovs_assert_failure (where="controller/ofctrl.c:1198",
> > function="flood_remove_flows_for_sb_uuid",
> > condition="ovs_list_is_empty(&f->list_node)") at
> > lib/util.c:86
> > 6 flood_remove_flows_for_sb_uuid (sb_uuid=...538,
> > flood_remove_nodes=...ed0) at controller/ofctrl.c:1205
> > 7 flood_remove_flows_for_sb_uuid (sb_uuid=...898,
> > flood_remove_nodes=...ed0) at controller/ofctrl.c:1230
> > 8 flood_remove_flows_for_sb_uuid (sb_uuid=...bf0,
> > flood_remove_nodes=...ed0) at controller/ofctrl.c:1230
> > 9 ofctrl_flood_remove_flows (flood_remove_nodes=...ed0) at
> > controller/ofctrl.c:1250
> > 10 lflow_handle_changed_ref (ref_type=REF_TYPE_PORTGROUP,
> > ref_name= "5564_pg_64...bac") at controller/lflow.c:612
> > 11 _flow_output_resource_ref_handler (ref_type=REF_TYPE_PORTGROUP)
> > at controller/ovn-controller.c:2181
> > 12 engine_compute () at lib/inc-proc-eng.c:306
> > 13 engine_run_node (recompute_allowed=true) at lib/inc-proc-eng.c:352
> > 14 engine_run (recompute_allowed=true) at lib/inc-proc-eng.c:377
> > 15 main () at controller/ovn-controller.c:2794
> > ***
> >
> > This assertion is seen when a port group gets updated and it is referenced
> > by many
> > logical flows (with conj actions). The function
> > ofctrl_flood_remove_flows(), calls
> > flood_remove_flows_for_sb_uuid() for each sb uuid in the hmap -
> > flood_remove_nodes
> > using HMAP_FOR_EACH (flood_remove_nodes). flood_remove_flows_for_sb_uuid()
> > also takes
> > the hmap 'flood_remove_nodes' as an argument and it inserts few items into
> > it when
> > it has to call itself recursively. When an item is inserted, its possible
> > that the
> > hmap may get expanded. And if this happens, the HMAP_FOR_EACH () skips few
> > entries
> > causing some of the desired flows not getting cleared.
> >
> > Later when ofctrl_add_or_append_flow() is called, there would be multiple
> > 'struct sb_flow_ref' references for the same desired flow. And this causes
> > the
> > above assertion later when the same port group gets updated.
> >
> > This patch fixes this issue by cloning the hmap 'flood_remove_nodes' and
> > using it to
> > iterate the flood remove nodes. Also a test case is added to cover this
> > scenario.
> >
> > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1928012
>
> Fixes: 580aea72e26f ("ovn-controller: Fix conjunction handling with
> incremental processing.")
>
> > Suggested-by: Ilya Maximetes <[email protected]>
> > Signed-off-by: Numan Siddique <[email protected]>
> > ---
> > controller/ofctrl.c | 15 ++++++++-
> > tests/ovn.at | 75 +++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 89 insertions(+), 1 deletion(-)
> >
> > diff --git a/controller/ofctrl.c b/controller/ofctrl.c
> > index 9d62e1260..9de912478 100644
> > --- a/controller/ofctrl.c
> > +++ b/controller/ofctrl.c
> > @@ -1247,10 +1247,23 @@ ofctrl_flood_remove_flows(struct
> > ovn_desired_flow_table *flow_table,
> > struct hmap *flood_remove_nodes)
> > {
> > struct ofctrl_flood_remove_node *ofrn;
> > +
> > + /* flood_remove_flows_for_sb_uuid() will modify the
> > 'flood_remove_nodes'
> > + * hash map by inserting new items, so we can't use it for iteration.
> > + * Copying the sb_uuids into an array. */
> > + struct uuid *sb_uuids;
> > + int i, n = 0;
>
> I'd add an empty line here since it's not a single-line definition or
> just moved variable definitions to the top. But that's not really important.
> Fine for me either way or as it is now.
Ack done.
>
> > + sb_uuids = xmalloc(hmap_count(flood_remove_nodes) * sizeof *sb_uuids);
> > + struct hmap flood_remove_uuids = HMAP_INITIALIZER(&flood_remove_uuids);
> > HMAP_FOR_EACH (ofrn, hmap_node, flood_remove_nodes) {
> > - flood_remove_flows_for_sb_uuid(flow_table, &ofrn->sb_uuid,
> > + sb_uuids[n++] = ofrn->sb_uuid;
> > + }
> > +
> > + for (i = 0; i < n; i++) {
> > + flood_remove_flows_for_sb_uuid(flow_table, &sb_uuids[i],
> > flood_remove_nodes);
> > }
> > + free(sb_uuids);
> >
> > /* remove any related group and meter info */
> > HMAP_FOR_EACH (ofrn, hmap_node, flood_remove_nodes) {
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index 55ea6d170..a02098bac 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -24049,3 +24049,78 @@ wait_column "true" nb:Logical_Switch_Port up
> > name=lsp1
> >
> > OVN_CLEANUP([hv1])
> > AT_CLEANUP
> > +
> > +# Test case to check that ovn-controller doesn't assert when
> > +# handling port group updates.
> > +AT_SETUP([ovn -- No ovn-controller assert for port group updates])
> > +ovn_start
> > +
> > +net_add n1
> > +sim_add hv1
> > +as hv1
> > +ovs-vsctl add-br br-phys
> > +ovn_attach n1 br-phys 192.168.0.10
> > +
> > +as hv1
> > +ovs-vsctl \
> > + -- add-port br-int vif1 \
> > + -- set Interface vif1 external_ids:iface-id=sw0-port1 \
> > + ofport-request=1
> > +
> > +check ovn-nbctl ls-add sw0
> > +check ovn-nbctl lsp-add sw0 sw0-port1
> > +check ovn-nbctl lsp-set-addresses sw0-port1 "10:14:00:00:00:01 192.168.0.2"
> > +
> > +check ovn-nbctl lsp-add sw0 sw0-port2
> > +check ovn-nbctl lsp-add sw0 sw0-port3
> > +check ovn-nbctl lsp-add sw0 sw0-port4
> > +check ovn-nbctl lsp-add sw0 sw0-port5
> > +check ovn-nbctl lsp-add sw0 sw0-port6
> > +check ovn-nbctl lsp-add sw0 sw0-port7
> > +
> > +ovn-nbctl create address_set name=as1
> > +ovn-nbctl set address_set . addresses="10.0.0.10,10.0.0.11,10.0.0.12"
> > +
> > +ovn-nbctl pg-add pg1 sw0-port1 sw0-port2 sw0-port3
> > +ovn-nbctl acl-add pg1 to-lport 1002 "outport == @pg1 && ip4.dst == \$as1
> > && icmp4" drop
> > +ovn-nbctl acl-add pg1 to-lport 1002 "outport == @pg1 && ip4.dst == \$as1
> > && tcp && tcp.dst >=10000 && tcp.dst <= 20000" drop
> > +ovn-nbctl acl-add pg1 to-lport 1002 "outport == @pg1 && ip4.dst == \$as1
> > && udp && udp.dst >=10000 && udp.dst <= 20000" drop
> > +
> > +ovn-nbctl pg-add pg2 sw0-port2 sw0-port3 sw0-port4 sw0-port5
> > +ovn-nbctl acl-add pg2 to-lport 1002 "outport == @pg2 && ip4.dst == \$as1
> > && icmp4" allow-related
> > +ovn-nbctl acl-add pg2 to-lport 1002 "outport == @pg2 && ip4.dst == \$as1
> > && tcp && tcp.dst >=30000 && tcp.dst <= 40000" drop
> > +ovn-nbctl acl-add pg2 to-lport 1002 "outport == @pg2 && ip4.dst == \$as1
> > && udp && udp.dst >=30000 && udp.dst <= 40000" drop
> > +
> > +ovn-nbctl pg-add pg3 sw0-port1 sw0-port5
> > +ovn-nbctl acl-add pg3 to-lport 1002 "outport == @pg3 && ip4.dst == \$as1
> > && icmp4" allow-related
> > +ovn-nbctl acl-add pg3 to-lport 1002 "outport == @pg3 && ip4.dst == \$as1
> > && tcp && tcp.dst >=20000 && tcp.dst <= 30000" allow-related
> > +ovn-nbctl acl-add pg3 to-lport 1002 "outport == @pg3 && ip4.dst == \$as1
> > && udp && udp.dst >=20000 && udp.dst <= 30000" allow-related
> > +
> > +AS_BOX([Delete and add the port groups multiple times])
> > +
> > +i=0
> > +while [[ $i -lt 10 ]]
> > +do
> > + ovn-nbctl --wait=sb clear port_Group pg1 ports
> > + ovn-nbctl --wait=sb clear port_Group pg2 ports
> > + ovn-nbctl --wait=sb clear port_Group pg3 ports
> > + ovn-nbctl --wait=sb pg-set-ports pg1 sw0-port1
> > + ovn-nbctl --wait=sb pg-set-ports pg1 sw0-port1 sw0-port4
> > + ovn-nbctl --wait=sb pg-set-ports pg1 sw0-port1 sw0-port4 sw0-port5
> > +
> > + ovn-nbctl --wait=sb pg-set-ports pg2 sw0-port2
> > + ovn-nbctl --wait=sb pg-set-ports pg2 sw0-port2 sw0-port6
> > + ovn-nbctl --wait=sb pg-set-ports pg2 sw0-port2 sw0-port6 sw0-port7
> > +
> > + ovn-nbctl --wait=sb pg-set-ports pg3 sw0-port1
> > + ovn-nbctl --wait=sb pg-set-ports pg3 sw0-port1 sw0-port3
> > + ovn-nbctl --wait=sb pg-set-ports pg3 sw0-port1 sw0-port3 sw0-port6
> > +
> > + i=$((i+1))
>
> This is likely not portable across different shells.
> It's better to replace the loop with simple 'for i in `seq 1 10`' and
> avoid this increment.
>
> > + # Make sure that ovn-controller has not asserted.
> > + AT_CAPTURE_FILE([hv1/ovn-controller.log])
> > + AT_CHECK([test 0 = $(grep -c "assertion
> > ovs_list_is_empty(&f->list_node) failed in
> > flood_remove_flows_for_sb_uuid()" hv1/ovn-controller.log)])
>
> This check looks too specific and could fall apart and stop checking
> anything if some part of the code will be changed, e.g. variables
> get renamed or assertion changed to another one. It seems better to
> just check if process is still running. Also '--wait=sb' above
> should probably be '--wait=hv', and they will catch the actual
> crash of ovn-controller since they will not be able to wait for hv.
Totally makes sense. In my testing somehow "--wait=hv" was blocked indefinitely.
But now it works fine i.e it times out if ovn-controller crashes when
this test case is run without
the fix. (Maybe I was too sleepy yesterday :)).
Thanks for the reviews. I applied this patch to master and backported
to branch-20.12 after
addressing your comments and incorporating your suggested changes.
I will backport the patch further down if it is required.
Thanks
Numan
>
> Suggesting the incremental change below. With this change:
>
> Acked-by: Ilya Maximets <[email protected]>
>
> ---
> diff --git a/tests/ovn.at b/tests/ovn.at
> index a02098bac..2b7a0705b 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -24098,28 +24098,25 @@ ovn-nbctl acl-add pg3 to-lport 1002 "outport ==
> @pg3 && ip4.dst == \$as1 && udp
>
> AS_BOX([Delete and add the port groups multiple times])
>
> -i=0
> -while [[ $i -lt 10 ]]
> +for i in `seq 1 10`
> do
> - ovn-nbctl --wait=sb clear port_Group pg1 ports
> - ovn-nbctl --wait=sb clear port_Group pg2 ports
> - ovn-nbctl --wait=sb clear port_Group pg3 ports
> - ovn-nbctl --wait=sb pg-set-ports pg1 sw0-port1
> - ovn-nbctl --wait=sb pg-set-ports pg1 sw0-port1 sw0-port4
> - ovn-nbctl --wait=sb pg-set-ports pg1 sw0-port1 sw0-port4 sw0-port5
> -
> - ovn-nbctl --wait=sb pg-set-ports pg2 sw0-port2
> - ovn-nbctl --wait=sb pg-set-ports pg2 sw0-port2 sw0-port6
> - ovn-nbctl --wait=sb pg-set-ports pg2 sw0-port2 sw0-port6 sw0-port7
> -
> - ovn-nbctl --wait=sb pg-set-ports pg3 sw0-port1
> - ovn-nbctl --wait=sb pg-set-ports pg3 sw0-port1 sw0-port3
> - ovn-nbctl --wait=sb pg-set-ports pg3 sw0-port1 sw0-port3 sw0-port6
> -
> - i=$((i+1))
> + check ovn-nbctl --wait=hv clear port_Group pg1 ports
> + check ovn-nbctl --wait=hv clear port_Group pg2 ports
> + check ovn-nbctl --wait=hv clear port_Group pg3 ports
> + check ovn-nbctl --wait=hv pg-set-ports pg1 sw0-port1
> + check ovn-nbctl --wait=hv pg-set-ports pg1 sw0-port1 sw0-port4
> + check ovn-nbctl --wait=hv pg-set-ports pg1 sw0-port1 sw0-port4 sw0-port5
> +
> + check ovn-nbctl --wait=hv pg-set-ports pg2 sw0-port2
> + check ovn-nbctl --wait=hv pg-set-ports pg2 sw0-port2 sw0-port6
> + check ovn-nbctl --wait=hv pg-set-ports pg2 sw0-port2 sw0-port6 sw0-port7
> +
> + check ovn-nbctl --wait=hv pg-set-ports pg3 sw0-port1
> + check ovn-nbctl --wait=hv pg-set-ports pg3 sw0-port1 sw0-port3
> + check ovn-nbctl --wait=hv pg-set-ports pg3 sw0-port1 sw0-port3 sw0-port6
> +
> # Make sure that ovn-controller has not asserted.
> - AT_CAPTURE_FILE([hv1/ovn-controller.log])
> - AT_CHECK([test 0 = $(grep -c "assertion ovs_list_is_empty(&f->list_node)
> failed in flood_remove_flows_for_sb_uuid()" hv1/ovn-controller.log)])
> + AT_CHECK([kill -0 `cat hv1/ovn-controller.pid`])
> done
>
> OVN_CLEANUP([hv1])
> ---
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev