Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

Numan Siddique Tue, 24 Aug 2021 11:55:30 -0700

On Tue, Aug 24, 2021 at 2:20 PM Anton Ivanov
<[email protected]> wrote:
>
> On 24/08/2021 19:16, Ilya Maximets wrote:
> > On 8/24/21 7:48 PM, Numan Siddique wrote:
> >> On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov
> >> <[email protected]> wrote:
> >>>
> >>> On 24/08/2021 17:35, Ilya Maximets wrote:
> >>>> On 8/24/21 6:25 PM, Numan Siddique wrote:
> >>>>> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov
> >>>>> <[email protected]> wrote:
> >>>>>> On 24/08/2021 12:46, Ilya Maximets wrote:
> >>>>>>> On 8/24/21 1:18 PM, Anton Ivanov wrote:
> >>>>>>>> On 24/08/2021 12:05, Ilya Maximets wrote:
> >>>>>>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote:
> >>>>>>>>>> On 23/08/2021 22:36, Ilya Maximets wrote:
> >>>>>>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote:
> >>>>>>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote:
> >>>>>>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote:
> >>>>>>>>>>>>>> Should not be the case.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The map is pre-sized for the size from the previous iterations.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Line 12861 in my tree which is probably a few commits out of 
> >>>>>>>>>>>>>> date:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>           fast_hmap_size_for(&lflows, max_seen_lflow_size);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> And immediately after building the lflows:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>           if (hmap_count(&lflows) > max_seen_lflow_size) {
> >>>>>>>>>>>>>>               max_seen_lflow_size = hmap_count(&lflows);
> >>>>>>>>>>>>>>           }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen 
> >>>>>>>>>>>>>> lflow count.
> >>>>>>>>>>>>> Please, re-read the commit message.  It's a first run of the 
> >>>>>>>>>>>>> loop
> >>>>>>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point.
> >>>>>>>>>>>> Ack,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks 
> >>>>>>>>>>>> for fixing it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>      From that perspective the patch is a straight +1 from me.
> >>>>>>>>>>>>
> >>>>>>>>>>>>      From the perspective of the use case stated in the commit 
> >>>>>>>>>>>> message- I am not sure it addresses it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> 
> >>>>>>>>>>>> database, it may never complete the first iteration before the
> >>>>>>>>>>>> expiration of the timers and everyone deciding that northd is 
> >>>>>>>>>>>> AWOL.
> >>>>>>>>>>> Well, how do you suggest to fix that?  Obviously, we can always 
> >>>>>>>>>>> create
> >>>>>>>>>>> a database that northd will never be able to process in a 
> >>>>>>>>>>> reasonable
> >>>>>>>>>>> amount of time.  And it doesn't matter if it's single- or 
> >>>>>>>>>>> multi-threaded.
> >>>>>>>>>>>
> >>>>>>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, 
> >>>>>>>>>>> and
> >>>>>>>>>>> northd on my laptop takes more than 15 minutes to process it (I 
> >>>>>>>>>>> killed
> >>>>>>>>>>> it at this point).  With the patch applied it took only 11 
> >>>>>>>>>>> seconds.
> >>>>>>>>>>> So, for me, this patch pretty much fixes the issue.  11 seconds 
> >>>>>>>>>>> is not
> >>>>>>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients 
> >>>>>>>>>>> to 180.
> >>>>>>>>>>> It would be great to reduce, but we're not there yet.
> >>>>>>>>>>>
> >>>>>>>>>>>> In that case, if it is multi-threaded from the start, it should 
> >>>>>>>>>>>> probably
> >>>>>>>>>>>> grab the sizing from the lflow table hash in south db. That 
> >>>>>>>>>>>> would be a
> >>>>>>>>>>>> good approximation for the first run.
> >>>>>>>>>>> This will not work for a case where SbDB is empty for any reason 
> >>>>>>>>>>> while
> >>>>>>>>>>> NbDB is not.  And there is still a case where northd initially 
> >>>>>>>>>>> connects
> >>>>>>>>>>> to semi-empty databases and after few iterations NbDB receives a 
> >>>>>>>>>>> big
> >>>>>>>>>>> transaction and generates a big update for northd.
> >>>>>>>>>> A partial fix is to resize to optimum size the hash after lflow 
> >>>>>>>>>> processing.
> >>>>>>>>> I'm not sure I understand what you mean here, because resizing after
> >>>>>>>>> lflow processing will not help.  The lflow processing itself is the
> >>>>>>>>> very slow part that we're trying to make faster here.
> >>>>>>>> That can be the case only with dpgroups. Otherwise lflows are just a 
> >>>>>>>> destination to dump data and the bucket sizing is irrelevant because 
> >>>>>>>> there is never any lookup inside lflows during processing. The lflow 
> >>>>>>>> map is just used to store data. So if it is suboptimal at the exit 
> >>>>>>>> of build_lflows() the resize will fix it before the first lookup 
> >>>>>>>> shortly thereafter.
> >>>>>>>>
> >>>>>>>> Are you running it with dpgroups enabled? In that case there are 
> >>>>>>>> lookups inside lflows during build which happen under a per-bucket 
> >>>>>>>> lock. So in addition to suboptimal size when searching the 
> >>>>>>>> contention depends on the number of buckets. If they are too few, 
> >>>>>>>> the system becomes heavily contended resulting in ridiculous 
> >>>>>>>> computation sizes.
> >>>>>>> Oh, I see.  Indeed, without dp-groups there is no lookup during
> >>>>>>> lflow build.  I missed that.  So, yes, I agree that for a case
> >>>>>>> without dp-groups, re-sizing after lflow processing should work.
> >>>>>>> We need that for parallel case.
> >>>>>>>
> >>>>>>> Current patch (use hmap_insert() that resizes if needed) helps
> >>>>>>> for all non-parallel cases.
> >>>>>> Indeed. It should go in.
> >>>>>>
> >>>>> Why can't we have hmap_insert() for both parallel and non parallel 
> >>>>> configs
> >>>>> to start with and switch over to hmap_insert_fast() when ovn-northd
> >>>>> has successfully connected to SB DB and has approximated on the
> >>>>> more accurate hmap size ?
> >>>> We can't use hmap_insert() for parallel, because resize of a hash map
> >>>> will crash threads that are working on it at the moment, IIUC.
> >>> We actually can for non-dp-groups case, but the merge will become much 
> >>> slower. Example - the last version of the snapshot and monitor 
> >>> parallelization for OVS. It no longer uses pre-sized fixed hmaps because 
> >>> it is nearly impossible to presize an RFC7047 structure correctly.
> >>>
> >>> For the dp-groups case we can't. Locking of the lflows which are used for 
> >>> lookups is per hash bucket. You cannot change the number of buckets in 
> >>> the middle of the run and other locking mechanisms will be more coarse. 
> >>> So the marginal benefit with dp-groups at present will become none (or 
> >>> even slower).
> >>>
> >>>> We could disable parallel for first several iterations, but still,
> >>>> this doesn't cover all the cases.  i.e. the one where we have a big
> >>>> update from the NbDB with a fairly small DB before that.
> >>> I am working on that. Let me collect some stat data for the current OVN 
> >>> master with all the recent changes. They have changed the "heat map" for 
> >>> the dp_groups case.
> >>>
> >>> We may need to do some adjustments to what and where we parallelize in 
> >>> the dp-groups case (if at all).
> >> Ok.  You're fine with the proposed patch here right ?
> >>
> >> @Ilya - Can you please update the commit message to include that the
> >> issue is seen with dp groups enabled ?
> > I'm not sure about that, because issue will still be there, but in a
> > different place.  We will not have lookups during construction of new
> > logical flows, but we will have them while trying to reconcile existing
> > SbDB flows and this will be slow too, i.e. the loop
> > '/* Push changes to the Logical_Flow table to database. */' will be a
> > a few thousand times slower.  So, the issue is less severe, but it
> > still there even with dp groups disabled.


Right.  Thanks for pointing this out.

I applied this patch to the main branch and backported to branch-21.06.

Numan

>
> I am OK with the log entry as is. In the absencee of a
> hmap_expand(&lflows) right after the computation it is valid for both
> dp_groups and non dp_groups cases.
>
> >
> >> I guess you can share the commit message here.  No need to submit v2 for 
> >> that.
> >>
> >> Thanks
> >> Numan
> >>
> >> Thanks
> >> Numan
> >>> A.
> >>>
> >>>>> Thanks
> >>>>> Numan
> >>>>>
> >>>>>> I will sort out the other cases to the extent possible.
> >>>>>> Brgds,
> >>>>>>
> >>>>>> A.
> >>>>>>
> >>>>>>> I'm mostly running dp-groups + non-parallel which is a default
> >>>>>>> case for ovn-heater/ovn-k8s.
> >>>>>>>
> >>>>>>>> For the case of "dpgroups + parallel + first iteration + 
> >>>>>>>> pre-existing large database" there is no cure short of 
> >>>>>>>> pre-allocating the hash to maximum size.
> >>>>>>> Yeah, dp-groups + parallel is a hard case.
> >>>>>>>
> >>>>>>>> I am scale testing that as well as resize (for non-dp-groups cases) 
> >>>>>>>> at present.
> >>>>>>>>
> >>>>>>>> Brgds,
> >>>>>>>>
> >>>>>>>> A.
> >>>>>>>>
> >>>>>>>>>> If the sizing was correct - 99.9% of the case this will be a noop.
> >>>>>>>>>>
> >>>>>>>>>> If the sizing was incorrect, it will be resized so that the DP 
> >>>>>>>>>> searches and all other ops which were recently added for flow 
> >>>>>>>>>> reduction will work optimally.
> >>>>>>>>>>
> >>>>>>>>>> This still does not work for lflow compute with dpgroups + 
> >>>>>>>>>> parallel upon initial connect and without a SB database to use for 
> >>>>>>>>>> size guidance. It will work for all other cases.
> >>>>>>>>>>
> >>>>>>>>>> I will send two separate patches to address the cases which can be 
> >>>>>>>>>> easily addressed and see what can be done with the dp+parallel 
> >>>>>>>>>> upon initial connect to an empty sb database.
> >>>>>>>>>>
> >>>>>>>>>> Brgds,
> >>>>>>>>>>
> >>>>>>>>>> A
> >>>>>>>>>>
> >>>>>>>>>>>> A.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> A.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote:
> >>>>>>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast
> >>>>>>>>>>>>>>> insertion.  This leads to the case where we have a hash map
> >>>>>>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends
> >>>>>>>>>>>>>>> up iterating over n_lflows / 128 entries.  It's thousands of
> >>>>>>>>>>>>>>> logical flows or even more.  For example, it takes forever for
> >>>>>>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 
> >>>>>>>>>>>>>>> node
> >>>>>>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is 
> >>>>>>>>>>>>>>> slower
> >>>>>>>>>>>>>>> than previous one.  I aborted the process after 15 minutes of
> >>>>>>>>>>>>>>> waiting, because there was no sign that it will converge.  
> >>>>>>>>>>>>>>> With
> >>>>>>>>>>>>>>> this change applied the loop completes in only 11 seconds.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of
> >>>>>>>>>>>>>>> logical flows on a second iteration, but this doesn't help for
> >>>>>>>>>>>>>>> the first iteration when northd first time connects to a big
> >>>>>>>>>>>>>>> Northbound database, which is a common case during failover or
> >>>>>>>>>>>>>>> cluster upgrade.  And there is an even trickier case where big
> >>>>>>>>>>>>>>> NbDB transaction that explodes the number of logical flows 
> >>>>>>>>>>>>>>> received
> >>>>>>>>>>>>>>> on not the first run.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We can't expand the hash map in case of parallel build, so 
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>> should be fixed separately.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> CC: Anton Ivanov <[email protected]>
> >>>>>>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow 
> >>>>>>>>>>>>>>> build")
> >>>>>>>>>>>>>>> Signed-off-by: Ilya Maximets <[email protected]>
> >>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>         northd/ovn-northd.c | 6 +++++-
> >>>>>>>>>>>>>>>         1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> >>>>>>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644
> >>>>>>>>>>>>>>> --- a/northd/ovn-northd.c
> >>>>>>>>>>>>>>> +++ b/northd/ovn-northd.c
> >>>>>>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap 
> >>>>>>>>>>>>>>> *lflow_map, struct ovn_datapath *od,
> >>>>>>>>>>>>>>>                            nullable_xstrdup(ctrl_meter),
> >>>>>>>>>>>>>>>                            ovn_lflow_hint(stage_hint), where);
> >>>>>>>>>>>>>>>             hmapx_add(&lflow->od_group, od);
> >>>>>>>>>>>>>>> -    hmap_insert_fast(lflow_map, &lflow->hmap_node, hash);
> >>>>>>>>>>>>>>> +    if (!use_parallel_build) {
> >>>>>>>>>>>>>>> +        hmap_insert(lflow_map, &lflow->hmap_node, hash);
> >>>>>>>>>>>>>>> +    } else {
> >>>>>>>>>>>>>>> +        hmap_insert_fast(lflow_map, &lflow->hmap_node, hash);
> >>>>>>>>>>>>>>> +    }
> >>>>>>>>>>>>>>>         }
> >>>>>>>>>>>>>>>           /* Adds a row with the specified contents to the 
> >>>>>>>>>>>>>>> Logical_Flow table. */
> >>>>>> --
> >>>>>> Anton R. Ivanov
> >>>>>> Cambridgegreys Limited. Registered in England. Company Number 10273661
> >>>>>> https://www.cambridgegreys.com/
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> dev mailing list
> >>>>>> [email protected]
> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>> --
> >>> Anton R. Ivanov
> >>> Cambridgegreys Limited. Registered in England. Company Number 10273661
> >>> https://www.cambridgegreys.com/
> >>>
> >>> _______________________________________________
> >>> dev mailing list
> >>> [email protected]
> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>
> >
>
> --
> Anton R. Ivanov
> Cambridgegreys Limited. Registered in England. Company Number 10273661
> https://www.cambridgegreys.com/
>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] ovn-northd: Fix extremely inefficient usage of lflow hash map.

Reply via email to