On Fri, Sep 23, 2022 at 8:10 AM Dumitru Ceara <[email protected]> wrote: > > On 9/22/22 19:55, Han Zhou wrote: > > On Thu, Sep 22, 2022 at 10:38 AM Han Zhou <[email protected]> wrote: > >> > >> > >> > >> On Thu, Sep 22, 2022 at 1:00 AM Dumitru Ceara <[email protected]> wrote: > >>> > >>> Hi Han, > >>> > >>> On 9/21/22 23:06, Han Zhou wrote: > >>>> Thanks Dumitru for this promising optimization! > >>>> > >>> > >>> Thanks for checking it out! > >>> > >>>> On Thu, Aug 11, 2022 at 1:03 AM Dumitru Ceara <[email protected]> > > wrote: > >>>>> > >>>>> On 8/10/22 19:54, Mark Michelson wrote: > >>>>>> Hi Dumitru, > >>>>>> > >>>>> > >>>>> Hi Mark, > >>>>> > >>>>>> I read the patch series, and I think the idea of chassis-specific > >>>>>> variables is a good idea to reduce the number of DB records for > > certain > >>>>>> things. Aside from load balancers, I suspect this could have a > > positive > >>>>>> impact for other structures as well. > >>>>>> > >>>>> > >>>>> Thanks for taking a look! Yes, I think this might be applicable to > >>>>> other structures too. > >>>>> > >>>> > >>>> I think it is a good idea to make it more generic, but for my > > understanding > >>>> this template concept is for anything that's "node/chassis" specific, > > and > >>>> supposed to be instantiated at chassis level. Probably we should name > > the > >>>> DB tables as something like chassis_template_var. > >>>> > >>> > >>> If we decide to go for a simple "chassis" column (instead of a generic > >>> "match" column) then naming the table Chassis_Template_Var makes sense > >>> indeed. > >>> > >>>>>> Rather than criticizing the individual lines of code, I'll focus > > instead > >>>>>> on some higher-level questions/ideas. > >>>>>> > >>>>> > >>>>> Sure, thanks! :) > >>>>> > >>>>>> First, one question I had was what happens when a template variable > > name > >>>>>> is used in a load balancer, but there is no appropriate value to > >>>>>> substitute? For instance, what if a load balancer applies to > > chassis-3, > >>>>>> but you only have template variables for chassis-1 and chassis-2? > > This > >>>>>> might be addressed in the code but I didn't notice if it was. > >>>>>> > >>>>> > >>>>> There are actually two things to consider here: > >>>>> > >>>>> 1. there might be a logical flow that uses a template variable: in > > this > >>>>> case if template expansion/instantiation fails we currently leave the > >>>>> token untouched (e.g., '^variable' stays '^variable'). That will > > cause > >>>>> the flow action/match parsing to fail and currently logs a warning. > > The > >>>>> flow itself is skipped, as it should be. We probably need to avoid > >>>>> logging a warning though. > >>>>> > >>>>> 2. like you pointed out, there might be a load balancer using > > templates > >>>>> in its backends/vips: if some of those templates cannot be > > instantiated > >>>>> locally the backend/vip where they're added is skipped. Unless I > > missed > >>>>> something, the code should already do that. > >>>>> > >>>>>> Second, it seems like template variables are a natural extension of > >>>>>> existing concepts like address sets and port groups. In those cases, > >>>>>> they were an unconditional collection of IP addresses or ports. For > >>>>> > >>>>> You're right to some extent template variables are similar to port > >>>>> groups. The southbound database port group table splits the > > northbound > >>>>> port group per datapath though not per chassis like template > > variables. > >>>>> > >>>>>> template variables, they're a collection of untyped values with the > >>>>>> condition of only applying on certain Chassis. I wonder if this > > could > >>>>>> all be reconciled with a single table that uses untyped values with > >>>>>> user-specified conditions. Right now template variables have a > > "Chassis" > >>>>>> column, but maybe this could be replaced with a broader "condition", > >>>>>> "when", or "match" column. To get something integrated quickly, this > >>>>>> column could just accept the syntax of "chassis.name == <blah>" or > >>>>>> "chassis.uuid == <blah>" to allow for chassis-specific application > > of > >>>>>> the values. With this foundation, we could eventually allow > >>>>>> unconditional application of the value, or more complex conditions > > (e.g. > >>>>>> only apply to logical switch ports that are connected to a router > > with a > >>>>>> distributed gateway port). Doing this, we could deprecate address > > sets > >>>>>> and port groups eventually in favor of template variables. > >>>>> > >>>>> This sounds like a good idea to me. I wasn't too happy with the > >>>>> "chassis" string column of the Template_Var table anyway. A generic > >>>>> condition field makes more sense. > >>>>> > >>>> If it is chassis-specific template, a column "chassis" seems to be > >>>> straightforward. With a "match" column it is another burden of parsing > >>> > >>> I have a generic implementation (with a "predicate" column) almost ready > >>> for review. I agree it's a bit more work to parse and maintain > >>> references. I think it's probably best to discuss these details once I > >>> post v1. It's no problem for me to go back to the "chassis" column > >>> version if we decide to use that approach. > >>> > >>>> (which is costly and error prone). In addition, the LB object (or > > other > >>>> structures) is not a logical-flow, and it doesn't directly map to > >>>> logical-flows (unlike ACLs), so I didn't understand how would a match > >>>> string be applied to the template. Is there a more detailed example of > >>>> this? Maybe I am missing something, and hope we will see more details > > in > >>>> the formal patch. > >>>> > >>> > >>> Load balancer VIPs are a collection of key-value pairs (vip:backends). > >>> These are currently IPs (and optionally L4 ports). We can allow LBs to > >>> support "templated" VIPs/backends. For example (this is with the WIP v1 > >>> version of the code): > >>> > >>> # Create a template LB. > >>> ovn-nbctl --template lb-add lb-test "^vip:^vport" "^backends" tcp \ > >>> -- ls-lb-add ls1 lb-test \ > >>> -- lr-lb-add rtr lb-test > >>> > >>> # Instantiate the LB template variables, for the chassis where port > >>> # vm1 is bound: > >>> ovn-nbctl \ > >>> -- create template_var name=vip value=66.66.66.66 \ > >>> predicate="is_chassis_resident(\"vm1\")" \ > >>> -- create template_var name=vport value=666 \ > >>> predicate="is_chassis_resident(\"vm1\")" \ > >>> -- create template_var name=backends value=\"42.42.42.2:4242\" \ > >>> predicate="is_chassis_resident(\"vm1\")" > >>> > >>> This is equivalent to: > >>> ovn-nbctl lb-add lb-test 66.66.66.66:666 42.42.42.2:4242 tcp > >>> ovn-nbctl ls-lb-add ls1 lb-test > >>> ovn-nbctl lr-lb-add lr1 lb-test > >>> > >> > >> Thanks, this helps. I guess we wouldn't want to use backends as variables > > in the real use cases, right? Otherwise, creating backends value for each > > chassis sounds not helpful for the scale. So, if I understand correctly, > > what we would do is: > >> > >> # Create a template LB. > >> ovn-nbctl --template lb-add lb-test "^vip:<the real port>" "<real > > backends>" tcp \ > >> -- ls-lb-add ls1 lb-test \ > >> -- lr-lb-add rtr lb-test > >> > >> # Instantiate the LB template variables, for the chassis where DGPs are > > bound: > >> ovn-nbctl \ > >> -- create template_var name=vip value=66.66.66.66 \ > >> predicate="is_chassis_resident(\"lrp1\")" \ > >> -- create template_var name=vip value=66.66.66.67 \ > >> predicate="is_chassis_resident(\"lrp2\")" \ > >> -- create template_var name=vip value=66.66.66.68 \ > >> predicate="is_chassis_resident(\"lrp3\")" > >> > > It can be that we need to use backends as variables too. In the worst > case, we still probably end up using less resources than with explicit > load balancers. But yes, very likely in a lot of cases with ovn-k8s > deployments we'll be able to define load balancers like you did above. > > >> Now that in ovn-k8s since we use DGP to pin LS to chassis, the lrp1, 2, 3 > > resides on different nodes, so the "vip" var is instantiated to the > > chassis-specific value. So this effectively binds the LB VIP to chassis > > without explicit "chassis" column. Did I understand the magic correctly? > > That's the idea, yes. > > >> If so, the predicate needs to be in logical flow match format, and the > > string will be directly used as match condition incorporated to related > > logical flows, right? One problem though, is that for DGPs, > > is_chassis_resident needs to evaluate the CR port "cr-xxx", which is > > internally constructed by OVN, and is not exposed to CMS. Once the > > implementation changes, the predicate wouldn't work. > > True. I was thinking of using the logical router port of the GW router > in ovn-k8s though. > > >> (I am looking forward to seeing your v1, but I think this discussion > > should help me understand the concept before going to the review :) > >> > > > > Sorry, one more point. With the "chassis" column, it is easy for > > conditional monitoring, considering that only a small number of values are > > useful for a specific chassis. The predicate alone wouldn't tell this > > This is a very good point. Conditional monitoring will be hard/impossible. > > > information. If we go with the predicate approach, I think it may be > > helpful to add another column to tell the datapath information (in NB, use > > LS/LR), so that it can be converted to DP in SB and be conditionally > > monitored (reusing the existing ovn-controller "local datapath" mechanism). > > It can be "any", to indicate something that applies to multiple/all DPs, if > > useful for the future use cases. > > I wonder if this is not too complex. Maybe we should just go to the > "chassis" column approach instead. To argue against chassis not being a > NB OVN entity, we already have NB tables that refer to chassis: > HA_Chassis_Group, HA_Chassis, Gateway_Chassis. > I am not too worried about the chassis not being a NB entity either. I think it is more of a tradeoff between flexibility and simplicity. The predicate approach can support something not related to chassis, such as: "instantiate the backends to value abc for ip.src == xxx". Maybe it is not a meaningful example, and I am not really sure if it is something useful in the future at all. So, just my 2 cents - for the purpose of this series, if we don't want to make it too complex, maybe "chassis" is a better option. If in the future we do need more flexibility, we can still introduce the predicate, although we may also need to take care of the backward compatibility/upgrade, if that's not too bad. Of course maybe I am short-sighted. I'd like to hear if anyone comes up with some realistic use cases that require the flexible approach, and then we will need to figure out a way for cond-monitoring.
Thanks, Han > > > >> Thanks, > >> Han > >> > >>>>> Regarding deprecating and replacing address sets and port groups, I'm > >>>>> not sure how easy that would be but we can try it when we get to that > >>>> point. > >>>>> > >>>> > >>>> Address sets and port groups are something different in my view. > > Although > >>>> they can be treated as variables in a template, they are not really > >>>> chassis-specific, and each variable needs to be instantiated to a big > >>>> number of instances (sometimes huge). For this reason, fine-grained > > I-P > >>>> embedded in the expression parsing (for Address set) was introduced > > for the > >>>> performance of ovn-controller. Maybe we can say there are still some > >>>> similarities of templates, but I am not really sure if it is really > > helpful > >>>> to generalize them and how difficult it would be. > >>>> > >>> > >>> It doesn't look straight forward indeed. If everyone agrees, I'd > >>> suggest discussing this option (replacing AS and PG) only after/if the > >>> template support is accepted. > >>> > >>>> Thanks, > >>>> Han > >>>> > >>> > >>> Thanks, > >>> Dumitru > >>> > >>>>>> > >>>>>> Third, I was wondering if there could be some layer that exists > > between > >>>>>> the IDL and the application that expands the template variables as > > early > >>>>>> as possible. I'm thinking the application could inject some > > callback in > >>>>>> the IDL layer that might allow for the values to be substituted. > > This > >>>>>> way, the variable substitution is taken care of in a single place, > > and > >>>>>> by the time the application gets the data, it knows that all > >>>>>> substitutions have been made and there is no need to special case > >>>>>> template variable names vs. plain tokens. They should all be plain > >>>>>> tokens. I don't think construction of such a layer should be a > > barrier > >>>>>> to merging the code, but it's something worth considering as a later > >>>>>> improvement. > >>>>> > >>>>> This could work. I'll think more about it. But like you said, it's > >>>>> probably a longer term goal. It will need some significant changes > > in > >>>>> the IDL layer (e.g., to re-evaluate some records when template > >>>>> instantiations change). > >>>>> > >>>>>> > >>>>>> Anyway, those were my high-level thoughts on the topic. Let me know > > what > >>>>>> you think. > >>>>>> > >>>>> > >>>>> I can work on changing the Template_Var schema to add a broader way > > of > >>>>> specifying conditions (when/match/etc). I'm already working on > > adding > >>>>> proper nbctl support for templated load balancers and trying to > > tackle > >>>>> the rest of the todos. I can probably send a v1 sometime in the > > first > >>>>> half of next week. Do you want to share any specific code related > >>>>> comments that I should already integrate or shall we start a proper > >>>>> review when v1 gets posted? > >>>>> > >>>>> Thanks again for your input on this RFC! > >>>>> > >>>>> Regards, > >>>>> Dumitru > >>>>> > >>>>>> On 8/5/22 12:26, Dumitru Ceara wrote: > >>>>>>> Sometimes network components are compute node-specific. Sometimes > > such > >>>>>>> components are replicated, almost identically, for multiple nodes > >>>>>>> in the cluster. > >>>>>>> > >>>>>>> One such example is the case of Kubernetes NodePort services which > >>>>>>> translate (in the ovn-kubernetes case) to Load_Balancer > >>>>>>> objects being applied to each and every node's logical gateway > > router. > >>>>>>> These load balancers are almost identical, the main difference > > being > >>>>>>> the fact that they use different VIPs (the node's IP). > >>>>>>> > >>>>>>> With the current OVN load balancer design, this becomes a problem > > at > >>>>>>> scale because the number of load balancers that must be configured > > is > >>>>>>> N x M (N nodes times M services). > >>>>>>> > >>>>>>> This series proposes a new concept in OVN: virtual network > > component > >>>>>>> templates. The goal of the templates is to help reduce resource > >>>>>>> consumption in the OVN central components in specific cases like > > the > >>>> one > >>>>>>> described above. > >>>>>>> > >>>>>>> To achieve that, the CMS will instead configure a "templated" load > >>>>>>> balancer for every service and apply that single template record to > >>>>>>> the cluster-wide load balancer group. This template is then > >>>>>>> instantiated differently on different compute nodes. This > > translation > >>>>>>> is controlled through per-chassis "template variables" configured > > by > >>>>>>> the CMS in the new NB.Template_Var table. > >>>>>>> > >>>>>>> A syntetic benchmark simulating what an OpenShift router (using > > Node > >>>>>>> Port services) scale test would do shows the following preliminary > >>>>>>> results: > >>>>>>> A. 120 node, 2K NodePort services: > >>>>>>> - before: > >>>>>>> - Southbound DB size on disk (compacted): ~385MB > >>>>>>> - Southbound DB memory usage (RSS): ~3GB > >>>>>>> - Southbound DB logical flows: 720K > >>>>>>> > >>>>>>> - after: > >>>>>>> - Southbound DB size on disk (compacted): ~100MB > >>>>>>> - Southbound DB memory usage (RSS): ~250MB > >>>>>>> - Southbound DB logical flows: 6K > >>>>>>> > >>>>>>> B. 250 node, 2K NodePort services: > >>>>>>> - after (didn't run the "before" test as it was taking way too > > long): > >>>>>>> - Southbound DB size on disk (compacted): ~155MB > >>>>>>> - Southbound DB memory usage (RSS): ~760MB > >>>>>>> - Southbound DB logical flows: 6K > >>>>>>> > >>>>>>> The series is sent as RFC because there's still the need to add > >>>>>>> some template specific unit tests and the "ovn-nbctl lb-*" helper > >>>>>>> utilities need to be adapted to support templated load balancers. > >>>>>>> > >>>>>>> With these two items addressed the code is self can likely qualify > >>>>>>> for acceptance as a new feature in the upcoming release. > >>>>>>> > >>>>>>> There also exists a more extensive TODO list (also listed in the > > commit > >>>>>>> log of every patch in the series for now) but these are mainly load > >>>>>>> balancer related functionalities that are not yet implemented for > >>>>>>> templated load balancers but can definitely be implemented as > > follow > >>>> ups: > >>>>>>> - No support for LB health check if the LB is templated. > >>>>>>> - No support for VIP ARP responder if the LB is templated. > >>>>>>> - No support for routed VIPs if the LB is templated. > >>>>>>> - Figure out a way to deal with templates in ovn-trace > >>>>>>> - Determine if we need to allow Template_Var to match against > > chassis > >>>>>>> hostname or other IDs. > >>>>>>> - Make ofctrl_inject_pkt() work with template_vars. > >>>>>>> - Make test-ovn work with template_vars. > >>>>>>> > >>>>>>> A basic example of how to configure a templated load balancer > > follows: > >>>>>>> $ ovn-nbctl create load_balancer name=lb-test \ > >>>>>>> protocol=tcp options:template=true \ > >>>>>>> vips:\"^vip:4200\"="^backends" > >>>>>>> > >>>>>>> $ ovn-nbctl ls-add ls > >>>>>>> $ ovn-nbctl ls-lb-add ls lb-test > >>>>>>> > >>>>>>> # Instantiate the load balancer on chassis-1 > >>>>>>> $ ovn-nbctl create template_var name=vip value=80.80.80.1 > >>>>>>> chassis=chassis-1 > >>>>>>> $ ovn-nbctl create template_var name=backends > >>>>>>> value='"42.42.42.1:1000"' chassis=chassis-1 > >>>>>>> > >>>>>>> # Instantiate the load balancer on chassis-2 > >>>>>>> $ ovn-nbctl create template_var name=vip value=80.80.80.2 > >>>>>>> chassis=chassis-2 > >>>>>>> $ ovn-nbctl create template_var name=backends > >>>>>>> value='"42.42.42.2:1000"' chassis=chassis-2 > >>>>>>> > >>>>>>> Dumitru Ceara (5): > >>>>>>> Add NB and SB Template_Var tables. > >>>>>>> controller: Add support for templated actions and matches. > >>>>>>> controller: Make resource references more generic. > >>>>>>> lb: Support using templates. > >>>>>>> controller: Add Template_Var <- LB references. > >>>>>>> > >>>>>>> > >>>>>>> controller/lflow.c | 248 > > ++++++++++++++++++++++++-------- > >>>>>>> controller/lflow.h | 98 +++++++------ > >>>>>>> controller/ofctrl.c | 9 +- > >>>>>>> controller/ofctrl.h | 3 +- > >>>>>>> controller/ovn-controller.c | 277 > >>>> ++++++++++++++++++++++++++++++++++-- > >>>>>>> include/ovn/expr.h | 4 +- > >>>>>>> include/ovn/lex.h | 14 +- > >>>>>>> lib/actions.c | 9 +- > >>>>>>> lib/expr.c | 14 +- > >>>>>>> lib/lb.c | 201 ++++++++++++++++++++++---- > >>>>>>> lib/lb.h | 36 +++-- > >>>>>>> lib/lex.c | 55 +++++++ > >>>>>>> northd/northd.c | 64 +++++---- > >>>>>>> tests/ovn.at | 2 +- > >>>>>>> tests/test-ovn.c | 16 ++- > >>>>>>> utilities/ovn-trace.c | 26 +++- > >>>>>>> 16 files changed, 869 insertions(+), 207 deletions(-) > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> dev mailing list > >>>>>>> [email protected] > >>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>>>>>> > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> dev mailing list > >>>>> [email protected] > >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>>> > >>> > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
