On Sat, Oct 8, 2022 at 1:12 PM Dumitru Ceara <[email protected]> wrote:
>
> On 10/7/22 23:16, Han Zhou wrote:
> > On Fri, Oct 7, 2022 at 12:23 PM Ilya Maximets <[email protected]>
wrote:
> >>
> >> On 10/7/22 20:31, Han Zhou wrote:
> >>>
> >>>
> >>> On Fri, Oct 7, 2022 at 10:24 AM Ilya Maximets <[email protected]
> > <mailto:[email protected]>> wrote:
> >>>>
> >>>> On 10/7/22 10:26, Dumitru Ceara wrote:
> >>>>> On 10/7/22 04:03, Han Zhou wrote:
> >>>>>> On Fri, Sep 30, 2022 at 7:02 AM Dumitru Ceara <[email protected]
> > <mailto:[email protected]>> wrote:
> >>>>>>>
> >>>>>>> Expand SB.Template_Var records in two stages:
> >>>>>>> 1. first expand them to local values in match/action strings
> >>>>>>> 2. then reparse the expanded strings
> >>>>>>>
> >>>>>>> For the case when a lflow references a Template_Var also track
> >>>>>>> references (similar to the ones maintained for multicast groups,
> > address
> >>>>>>> sets, port_groups, port bindings).
> >>>>>>>
> >>>>>>> Signed-off-by: Dumitru Ceara <[email protected] <mailto:
> > [email protected]>>
> >>>>>>
> >>>>>> Hi Dumitru,
> >>>>>>
> >>>>>
> >>>>> Hi Han,
> >>>>>
> >>>>>> In addition to the two-stage parsing concerns we are discussing in
> > the
> >>>>>
> >>>>> I'm doing some more targetted performance testing.  I'll update the
> >>>>> other thread when I have more data to share.
> >>>>>
> >>>>>> other thread, I have some minor comments below. The major one is
> > whether we
> >>>>>> should allow matching hostname or not.
> >>>>>>
> >>>>>>> ---
> >>>>>>>  controller/lflow.c          |   59 +++++++-
> >>>>>>>  controller/lflow.h          |    1
> >>>>>>>  controller/lport.c          |    3
> >>>>>>>  controller/ofctrl.c         |    9 +
> >>>>>>>  controller/ofctrl.h         |    3
> >>>>>>>  controller/ovn-controller.c |  317
> >>>>>> +++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>  include/ovn/expr.h          |    4 -
> >>>>>>>  include/ovn/lex.h           |   14 +-
> >>>>>>>  lib/actions.c               |    9 +
> >>>>>>>  lib/expr.c                  |   18 ++
> >>>>>>>  lib/lex.c                   |   55 +++++++
> >>>>>>>  lib/objdep.c                |    1
> >>>>>>>  lib/objdep.h                |    1
> >>>>>>>  lib/ovn-util.c              |    7 +
> >>>>>>>  lib/ovn-util.h              |    3
> >>>>>>>  tests/ovn.at <http://ovn.at>                |    2
> >>>>>>>  tests/test-ovn.c            |   16 ++
> >>>>>>>  utilities/ovn-trace.c       |   26 +++-
> >>>>>>>  18 files changed, 512 insertions(+), 36 deletions(-)
> >>>>>>>
> >>>>
> >>>> ...
> >>>>
> >>>>>>> +struct ed_type_template_vars {
> >>>>>>> +    struct local_templates var_table;
> >>>>>>> +
> >>>>>>> +    bool change_tracked;
> >>>>>>> +    struct sset new;
> >>>>>>> +    struct sset deleted;
> >>>>>>> +    struct sset updated;
> >>>>>>> +};
> >>>>>>> +
> >>>>>>> +static void
> >>>>>>> +template_vars_init(const struct sbrec_template_var_table
> > *tv_table,
> >>>>>>> +                   const struct sbrec_chassis *chassis,
> >>>>>>> +                   struct local_templates *var_table)
> >>>>>>> +{
> >>>>>>> +    const struct sbrec_template_var *tv;
> >>>>>>> +    SBREC_TEMPLATE_VAR_TABLE_FOR_EACH (tv, tv_table) {
> >>>>>>> +        if (chassis_name_equals(tv->chassis_name, chassis)) {
> >>>>>>
> >>>>>> I am not sure if it is a good idea to allow using hostname to match
> > the
> >>>>>> template var name. It provides flexibility to CMS, but we will need
> > more
> >>>>>> complexity to protect against corner cases.
> >>>>>> For example, if there are two records:
> >>>>>> r1: name="abc", value="v1"
> >>>>>> r2: hostname="abc", value="v2"
> >>>>>>
> >>>>>> Now with the current logic, whichever is handled later will take
> > precedence
> >>>>>> (in the local_templates.vars) and the value will be used (assume r2
> > "v2" is
> >>>>>> used). This may be fine, because the user should be responsible for
> > the
> >>>>>> inconsistent configurations.
> >>>>>>
> >>>>>> Later, when the user detects the problem and wants to correct the
> >>>>>> configuration. He/she deletes the r2 and expects the var "abc" to
be
> >>>>>> expanded as "v1". But the logic in template_vars_update() would
call
> >>>>>> local_templates_remove() which simply deletes the var ("abc" ->
> > "v2")
> >>>>>> instead of replacing it with ("abc" -> "v1"). The uuid of "abc" ->
> > "v1"
> >>>>>> will still be left in the uuidset, which is useless. This is an
> > unexpected
> >>>>>> behavior.
> >>>>>>
> >>>>>> Similar behavior would happen if there are duplicate hostnames,
> > e.g.:
> >>>>>> r1: hostname="abc", value="v1"
> >>>>>> r2: hostname="abc", value="v2"
> >>>>>>
> >>>>>
> >>>>> Very good point, nice catch!
> >>>>
> >>>> In general it might make sense to choose a bit different database
> > schema,
> >>>> e.g.:
> >>>>
> >>>>         "Chassis_Template_Vars": {
> >>>>             "columns": {
> >>>>                 "chassis": {"type": "string"},
> >>>>                 "variables": {
> >>>>                     "type": {"key": "string", "value": "string",
> >>>>                              "min": 0, "max": "unlimited"}},
> >>>>             "indexes": [["chassis"]],
> >>>>             "isRoot": true}
> >>>>
> >>>> Here 'variables' or 'templates' or whatever you want to call it is a
> >>>> Var->Value map.
> >>>>
> >>>> Index on the 'chassis' column will provide uniqueness of chassis
names,
> >>>> map has unique keys, so all variable names are unique within a
chassis
> >>>> as well.  This should cover all the possible misconfigurations on the
> >>>> database level.
> >>>>
> >>>> As a bonus we will also save a lot of database space by not storing
> >>>> millions of copies of chassis and variable names.  May speed up the
> >>>> Nb->Sb synchronization as well.
> >>>>
> >>>> One downside is that we can't have true I-P for that table due to
> >>>> inability to track which values in a map actually changed.  Though
> >>>> it should be possible to figure out the diff in a semi-linear time
> >>>> from the number of variables.  OpenFlow rules can still be processed
> >>>> incrementally after that.  So, I'm not sure if that is a big
> > performance
> >>>> concern.  Testing is needed, I guess.
> >>>>
> >>>> Conditional monitoring will be very simple.  Chassis will receive
only
> >>>> one row in most cases.  update2 will cover variable updates.
> >>>>
> >>>> What do you think?
>
> Interesting, thanks for the suggestion!
>
> >>>>
> >>>> Best regards, Ilya Maximets.
> >>>
> >>> Thanks Ilya for the idea. I think it is definitely a good candidate to
> > be evaluated, but I am slightly in favor of the original schema, and
here
> > is why.
> >>> Firstly, I think the benefit of avoiding misconfigurations is somehow
> > not that important, if we agree that it is good enough to stick with
> > "chassis_name" and don't allow hostname matching.
> >>
> >> To be clear I wasn't arguing about hostnames, I don't think they are
> >> needed.  It was a suggestion for a general look of a schema regardless.
> >>
> >>>
> >>> So the more important factor to consider is actually the performance.
> > On this front, the upside is, as you mentioned, the size of the table
would
> > be smaller. However, I don't see it as a very strong benefit. The only
> > redundant part here is the chassis_name, which is usually an uuid, not
too
> > short but also not too long.
> >>
> >> UUID takes 38 bytes.  250node setup with 10K vars will result
> >> in about 100 MB overhead.  It is noticeable.  Not a huge amount,
> >> I agree, but fairly noticeable.
> >>
> >>> I agree that the waste of space for this can be big if there are tens
of
> > thousands of vars per chassis, but compared with the logical flow table
> > size this is still a very small portion of the total size. In addition,
the
> > DB sync matters mostly at a chassis startup, and then each incremental
> > change should be very small.
> >>
> >> I was thinking about Nb->Sb sync on northd.  We don't really have
> >> any real I-P there (or did I missed it in the patch set? I didn't
> >> look very close), so northd will need to re-check all that on
> >> every iteration.  Since this scales up very fast with the number
> >> of nodes, I'm concerned about northd performance here.
> >>
> > Ok, makes sense. Sorry that I didn't pay attention that you were talking
> > about nb->sb instead of sb->hv.
> >
>
> Chassis template variables are completely unrelated to any other tables
> populated by northd.  So I think it's (relatively) easy to add I-P in
> northd for chassis template var processing.
>
> >>>
> >>> On the other hand, I am more concerned with the OVSDB performance for
> > handling the huge map/set column data. I understand that there have been
> > lots of improvements made for the mutations, but I wonder if it is still
> > less efficient than simple CRUDs for very small rows.
> >>
> >> Almost.  I'm not concerned about performance of ovsdb-server.
> >> We're frequently running large scale tests with ovn-heater with
> >> tens of thousands load-balancers in a single group, for example,
> >> and ovsdb-server is barely loaded.
> >>
> >>> It *might* also be easier to further optimize the row based operations
> > (such as creating index for more efficient conditional monitor updates)
> > than for big columns.
> >>>
> >>> And of course, a strong negative side as you mentioned, is the I-P
> > handling, which is definitely important. The current patch shows great
> > simplicity in I-P compared with the address-set I-P that had a lot of
> > complexity and overhead just for preprocessing and tracking the changes.
> >>
> >> It's a simple smap comparison that should be linear in time to
> >> figure out what changed.  OpenFlows will be processed incrementally
> >> in the same way as they are in the current patch set.  So, I'm
> >> not sure if the performance impact will be noticeable.  Again,
> >> someone probably need to test that.
> >>
> > For performance, yes, the comparison is linear, but if N is very big,
the
> > cost may still be high (we have much less tolerance of CPU cost in
> > ovn-controller than in ovn-northd). I agree a test would help to
understand
> > the real impact.
> >
> > However, hopefully N shouldn't be so big. I had the impression that N
would
> > be the same as the number of LBs, but with a second thought I believe it
> > should be much less if configured properly. For VIP, each nodeport LB
for
> > the same node should share the same var; for LB port, the worst case is
> > that each LB is assigned a port per node, but I wonder CMS should be
smart
> > enough to just assign the same ports across all nodes thus not even
require
> > the LB port to be included in the template var.
> >
> > If that's the case, probably the performance concerns are less important
> > for either approach. The code complexity may still be considered.
> >
>
> Looking at the code that generates node port service load balancers in
> ovn-kubernetes [0] [1] indeed for the VIP the nodeport LB can use a
> single template variable for all load balancers.  For backends, on the
> other hand, it's a bit more complicated because in some cases
> (externalTrafficLocal/internalTrafficLocal) the set of backend IPs is
> different per node.  So it's not only the LB port that differs.
>
> In such cases we need a distinct backend template variable too.
>
> So, for a setup with N nodes and M node-port services, we end up with
> (worst case) N x M template variable records in the NB/SB.
>
> [0]
>
https://github.com/ovn-org/ovn-kubernetes/blob/61bd7e49d2286bf59b4f2175bef3ba374fb1b99d/go-controller/pkg/ovn/controller/services/load_balancer.go#L79
> [1]
>
https://github.com/ovn-org/ovn-kubernetes/blob/61bd7e49d2286bf59b4f2175bef3ba374fb1b99d/go-controller/pkg/ovn/controller/services/load_balancer.go#L299
>

Thanks for the pointers. The worst case N x M would be bad, but it seems
not really necessary in reality, and may be avoided by CMS. For the cases
listed in these ovn-k8s code, most of the LBs should still be able to share
backends, except:
- backends with host network IPs - only the nodes with such backends should
have special set of backends
- external/internal traffic policy - only the nodes with backends for the
service should have the LB (VIP=node IP, backend=local backend IP)
So, N x M would happen only if all services has backends running on all the
nodes, which is impossible when M is big (because each node has only
limited resources to run limited amount of backends)

Thanks,
Han

> >>>
> >>> So, for the above reasons, I tend to stay with the current schema,
> > unless
> >>> 1) we are sure that handling huge map columns is as efficient as
> > separate small rows in OVSDB;
> >>
> >> Should not be a problem with ovsdb-server 2.17 or higher.
> >>
> >>> 2) we improve IDL change tracking to support tracking mutations of
> > map/set columns gracefully.
> >>
> >> I'm not sure if smap comparison is a huge problem as I said above,
> >> so I don't know. IDL re-parsing might cause issues though.
> >
> > Yes, thanks for reminding about the IDL cost.
> >
> >>
> >>> Thoughts?
> >>
> >> Anyway, not a strong opinion on my side.  Just a suggestion or
> >> something to think about. :)
> >
> > I don't have a strong opinion either. Probably we should also consider
from
> > CMS point of view, which way is more client friendly - managing separate
> > records for each var or combining them per chassis. @Tim Rozet
> > <[email protected]> @Girish Moodalbail <[email protected]>
>
> I still think in the worst case (many node port services with different
> sets of backends per node) the lack of proper I-P and IDL re-parsing
> cost in ovn-controller might make a difference.
>
> Tim, Girish what are worst case scenarios we could expect with
> ovn-kubernetes?  I.e., how many nodes and how many node port services.
> And out of these how many with external/internal traffic policy
configured?
>
> >
> > Thanks,
> > Han
> >
>
> Thanks,
> Dumitru
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to