Re: [ovs-dev] [PATCH ovn] northd: Add back check for datapath hmap.

Mark Michelson via dev Mon, 01 Dec 2025 10:19:11 -0800

On Mon, Dec 1, 2025 at 1:02 PM Ales Musil <[email protected]> wrote:
>
>
>
> On Mon, Dec 1, 2025 at 6:53 PM Mark Michelson <[email protected]> wrote:
>>
>> On Mon, Dec 1, 2025 at 12:15 PM Ales Musil <[email protected]> wrote:
>> >
>> >
>> >
>> > On Mon, Dec 1, 2025 at 5:40 PM Ales Musil <[email protected]> wrote:
>> >>
>> >>
>> >>
>> >> On Mon, Dec 1, 2025 at 5:37 PM Mark Michelson <[email protected]> wrote:
>> >>>
>> >>> On Mon, Dec 1, 2025 at 10:50 AM Ales Musil via dev
>> >>> <[email protected]> wrote:
>> >>> >
>> >>> > The hmap could be NULL under one circumstance, node depending
>> >>> > on the northd data would run before the northd node right
>> >>> > after the start. Make sure we won't crash in that case.
>> >>>
>> >>> Can you explain this in a bit more detail? How does a node that
>> >>> depends on northd data run before the northd node?
>> >>
>> >>
>> >> It also depends on SB nodes. So it will get triggered from the SB update 
>> >> first.
>> >
>> >
>> > Looking through the issue once more you are right, that doesn't make sense.
>> > It's actually slightly different. We only pass ls_datapaths, so if the 
>> > datapath is lr
>> > our lr_datapaths are NULL so we would crash there. I can change the commit 
>> > message
>> > during merge to reflect that better:
>> >
>> > northd: Add back check for datapath hmap.
>> >
>> > In some cases northd is passing only ls_datapaths/lr_datapaths
>> > into the ovn_datapath_from_sbrec(), the other hmap being NULL.
>> > This is problematic when the function runs into the second datapath
>> > type "unexpectedly" during processing, which would lead to NULL ptr
>> > deref and crash:
>> >
>> > openvswitch/hmap.h:360:40: runtime error: member access within null 
>> > pointer of type 'const struct hmap'
>> >     0x0000004f5b74 in hmap_first_with_hash openvswitch/hmap.h:360:40
>> >     0x0000004f1d3a in ovn_datapath_find_ northd/northd.c:601:5
>> >     0x0000004f21ad in ovn_datapath_from_sbrec northd/northd.c:654:31
>> >     0x00000060aa50 in build_mcast_groups northd/en-multicast.c:280:14
>> >     0x000000608c4b in en_multicast_igmp_run northd/en-multicast.c:135:5
>> >     0x0000006f11ed in engine_recompute lib/inc-proc-eng.c:445:33
>> >     0x0000006ef315 in engine_run_node lib/inc-proc-eng.c:519:9
>> >     0x0000006ef315 in engine_run lib/inc-proc-eng.c:576:9
>> >     0x0000006635fc in inc_proc_northd_run northd/inc-proc-northd.c:608:5
>> >     0x0000005dc196 in main northd/ovn-northd.c:1109:36
>> >
>> > Make sure we check for the hmap being NULL before attempting to
>> > find the ovn_datapath struct.
>> >
>> > WDYT?
>>
>> That is much more clear than the original commit message, thanks.
>> However, this seems like a strange problem to have. en-multicast is
>> providing a SB datapath that it thinks should be a logical switch to
>> ovn_datapath_from_sbrec(). However, the datapath is actually a logical
>> router. This seems like something that should be fixed in the IGMP
>> processing code. If it's valid for IGMP groups to reference logical
>> routers, then both maps should be provided to
>> ovn_datapath_from_sbrec(). If IGMP groups cannot reference logical
>> routers, then there is some bug causing IGMP groups to have incorrect
>> datapath references.
>
>
> To answer both your and Ilya's comment. This is a little tricky
> it seems that during the test the tunnel_key was "swapped" between
> logical switch and logical router. So the code that process and created
> igmp_group would get the "wrong" datapath, in this case LR one
> while in fact it should have been LS, northd will actually remove that
> IGMP group eventually. So the SB DB will remain consistent in the end.
> We could probably check in pinctrl if the datapath is logical switch before
> attempting to create the IGMP group, that would solve this particular issue
> I guess. However, the fact remains that this check was in fact there 
> previously
> that's why we didn't see this happen. If you wish I can rework this and try 
> the
> pinctrl approach (not sure if it would break anything else). I would 
> personally
> rather got this fixed before anyone else runs into this or as matter of fact 
> any
> other potential "misuse" of the API.


OK, thanks again for the explanation. I think that this patch can
serve as a temporary fix.

It sounds like there is a bigger problem of IP multicast using SB
datapath binding tunnel keys as identifiers. SB datapath tunnel keys
can change due to updates to NB requested_tunnel_key values, or due to
enabling/disabling VXLAN mode. There may be other ways they can change
as well that I am forgetting about.

We should open an issue to change IP multicast to use the SB datapath
UUID (or some other invariant) instead of the tunnel key. When this is
done, we can also switch to using ovs_assert() in
ovn_datapath_from_sbrec() as Ilya suggested. This is not a trivial
fix, so that's why I think the NULL check can serve as a temporary fix
just to get rid of the crash.

With the assumption that further work will be done to fix the IP
multicast code and this change will be updated in the future,
Acked-by: Mark Michelson <[email protected]>

>
>>
>> >>
>> >>
>> >>>
>> >>>
>> >>> >
>> >>> > Fixes: 6919992d8781 ("Datapath_Binding: Separate type column and sync 
>> >>> > NB.UUID to SB.")
>> >>> > Reported-at: https://issues.redhat.com/browse/FDP-2757
>> >>> > Signed-off-by: Ales Musil <[email protected]>
>> >>> > ---
>> >>> >  northd/northd.c     |  4 ++++
>> >>> >  tests/ovn-northd.at | 29 +++++++++++++++++++++++++++++
>> >>> >  2 files changed, 33 insertions(+)
>> >>> >
>> >>> > diff --git a/northd/northd.c b/northd/northd.c
>> >>> > index ec219a0c7..3bd1f6849 100644
>> >>> > --- a/northd/northd.c
>> >>> > +++ b/northd/northd.c
>> >>> > @@ -647,6 +647,10 @@ ovn_datapath_from_sbrec(const struct hmap 
>> >>> > *ls_datapaths,
>> >>> >          return NULL;
>> >>> >      }
>> >>> >
>> >>> > +    if (!dps) {
>> >>> > +        return NULL;
>> >>> > +    }
>> >>> > +
>> >>> >      struct ovn_datapath *od = ovn_datapath_find_(dps, &key);
>> >>> >      if (od && (od->sdp->sb_dp == sb)) {
>> >>> >          return od;
>> >>> > diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
>> >>> > index 380c9ff80..0bc2808ac 100644
>> >>> > --- a/tests/ovn-northd.at
>> >>> > +++ b/tests/ovn-northd.at
>> >>> > @@ -18885,3 +18885,32 @@ AT_CHECK([ovn-sbctl lflow-list ls1 | grep 
>> >>> > ls_in_apply_port_sec | ovn_strip_lflow
>> >>> >
>> >>> >  AT_CLEANUP
>> >>> >  ])
>> >>> > +
>> >>> > +OVN_FOR_EACH_NORTHD_NO_HV([
>> >>> > +AT_SETUP([IGMP northd crash])
>> >>> > +ovn_start
>> >>> > +
>> >>> > +check ovn-nbctl lr-add lr
>> >>> > +check ovn-sbctl chassis-add hv1 geneve 127.0.0.1
>> >>> > +
>> >>> > +# Kill northd and start it paused this time.
>> >>> > +as northd
>> >>> > +OVS_APP_EXIT_AND_WAIT([ovn-northd])
>> >>> > +ovn_start_northd --paused
>> >>> > +
>> >>> > +# Create IGMP Group for lr datapath.
>> >>> > +check_uuid ovn-sbctl create IGMP_Group address=mrouters \
>> >>> > +    datapath=$(fetch_column Datapath_Binding _uuid 
>> >>> > external_ids:name=lr) \
>> >>> > +    chassis=$(fetch_column Chassis _uuid name=hv1) \
>> >>> > +    chassis_name=hv1
>> >>> > +wait_row_count Igmp_Group 1 address=mrouters
>> >>> > +
>> >>> > +# Unpause northd and check if we didn't crash.
>> >>> > +as northd ovn-appctl -t ovn-northd resume
>> >>> > +check ovn-nbctl --wait=sb sync
>> >>> > +
>> >>> > +wait_row_count Igmp_Group 0 address=mrouters
>> >>> > +
>> >>> > +OVN_CLEANUP_NORTHD
>> >>> > +AT_CLEANUP
>> >>> > +])
>> >>> > --
>> >>> > 2.51.1
>> >>> >
>> >>> > _______________________________________________
>> >>> > dev mailing list
>> >>> > [email protected]
>> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >>> >
>> >>>
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] northd: Add back check for datapath hmap.

Reply via email to