On 10/8/22 22:12, Dumitru Ceara wrote:
> On 10/7/22 23:16, Han Zhou wrote:
>> On Fri, Oct 7, 2022 at 12:23 PM Ilya Maximets <[email protected]> wrote:
>>>
>>> On 10/7/22 20:31, Han Zhou wrote:
>>>>
>>>>
>>>> On Fri, Oct 7, 2022 at 10:24 AM Ilya Maximets <[email protected]
>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> On 10/7/22 10:26, Dumitru Ceara wrote:
>>>>>> On 10/7/22 04:03, Han Zhou wrote:
>>>>>>> On Fri, Sep 30, 2022 at 7:02 AM Dumitru Ceara <[email protected]
>> <mailto:[email protected]>> wrote:
>>>>>>>>
>>>>>>>> Expand SB.Template_Var records in two stages:
>>>>>>>> 1. first expand them to local values in match/action strings
>>>>>>>> 2. then reparse the expanded strings
>>>>>>>>
>>>>>>>> For the case when a lflow references a Template_Var also track
>>>>>>>> references (similar to the ones maintained for multicast groups,
>> address
>>>>>>>> sets, port_groups, port bindings).
>>>>>>>>
>>>>>>>> Signed-off-by: Dumitru Ceara <[email protected] <mailto:
>> [email protected]>>
>>>>>>>
>>>>>>> Hi Dumitru,
>>>>>>>
>>>>>>
>>>>>> Hi Han,
>>>>>>
>>>>>>> In addition to the two-stage parsing concerns we are discussing in
>> the
>>>>>>
>>>>>> I'm doing some more targetted performance testing. I'll update the
>>>>>> other thread when I have more data to share.
>>>>>>
>>>>>>> other thread, I have some minor comments below. The major one is
>> whether we
>>>>>>> should allow matching hostname or not.
>>>>>>>
>>>>>>>> ---
>>>>>>>> controller/lflow.c | 59 +++++++-
>>>>>>>> controller/lflow.h | 1
>>>>>>>> controller/lport.c | 3
>>>>>>>> controller/ofctrl.c | 9 +
>>>>>>>> controller/ofctrl.h | 3
>>>>>>>> controller/ovn-controller.c | 317
>>>>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> include/ovn/expr.h | 4 -
>>>>>>>> include/ovn/lex.h | 14 +-
>>>>>>>> lib/actions.c | 9 +
>>>>>>>> lib/expr.c | 18 ++
>>>>>>>> lib/lex.c | 55 +++++++
>>>>>>>> lib/objdep.c | 1
>>>>>>>> lib/objdep.h | 1
>>>>>>>> lib/ovn-util.c | 7 +
>>>>>>>> lib/ovn-util.h | 3
>>>>>>>> tests/ovn.at <http://ovn.at> | 2
>>>>>>>> tests/test-ovn.c | 16 ++
>>>>>>>> utilities/ovn-trace.c | 26 +++-
>>>>>>>> 18 files changed, 512 insertions(+), 36 deletions(-)
>>>>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>>>> +struct ed_type_template_vars {
>>>>>>>> + struct local_templates var_table;
>>>>>>>> +
>>>>>>>> + bool change_tracked;
>>>>>>>> + struct sset new;
>>>>>>>> + struct sset deleted;
>>>>>>>> + struct sset updated;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +static void
>>>>>>>> +template_vars_init(const struct sbrec_template_var_table
>> *tv_table,
>>>>>>>> + const struct sbrec_chassis *chassis,
>>>>>>>> + struct local_templates *var_table)
>>>>>>>> +{
>>>>>>>> + const struct sbrec_template_var *tv;
>>>>>>>> + SBREC_TEMPLATE_VAR_TABLE_FOR_EACH (tv, tv_table) {
>>>>>>>> + if (chassis_name_equals(tv->chassis_name, chassis)) {
>>>>>>>
>>>>>>> I am not sure if it is a good idea to allow using hostname to match
>> the
>>>>>>> template var name. It provides flexibility to CMS, but we will need
>> more
>>>>>>> complexity to protect against corner cases.
>>>>>>> For example, if there are two records:
>>>>>>> r1: name="abc", value="v1"
>>>>>>> r2: hostname="abc", value="v2"
>>>>>>>
>>>>>>> Now with the current logic, whichever is handled later will take
>> precedence
>>>>>>> (in the local_templates.vars) and the value will be used (assume r2
>> "v2" is
>>>>>>> used). This may be fine, because the user should be responsible for
>> the
>>>>>>> inconsistent configurations.
>>>>>>>
>>>>>>> Later, when the user detects the problem and wants to correct the
>>>>>>> configuration. He/she deletes the r2 and expects the var "abc" to be
>>>>>>> expanded as "v1". But the logic in template_vars_update() would call
>>>>>>> local_templates_remove() which simply deletes the var ("abc" ->
>> "v2")
>>>>>>> instead of replacing it with ("abc" -> "v1"). The uuid of "abc" ->
>> "v1"
>>>>>>> will still be left in the uuidset, which is useless. This is an
>> unexpected
>>>>>>> behavior.
>>>>>>>
>>>>>>> Similar behavior would happen if there are duplicate hostnames,
>> e.g.:
>>>>>>> r1: hostname="abc", value="v1"
>>>>>>> r2: hostname="abc", value="v2"
>>>>>>>
>>>>>>
>>>>>> Very good point, nice catch!
>>>>>
>>>>> In general it might make sense to choose a bit different database
>> schema,
>>>>> e.g.:
>>>>>
>>>>> "Chassis_Template_Vars": {
>>>>> "columns": {
>>>>> "chassis": {"type": "string"},
>>>>> "variables": {
>>>>> "type": {"key": "string", "value": "string",
>>>>> "min": 0, "max": "unlimited"}},
>>>>> "indexes": [["chassis"]],
>>>>> "isRoot": true}
>>>>>
>>>>> Here 'variables' or 'templates' or whatever you want to call it is a
>>>>> Var->Value map.
>>>>>
>>>>> Index on the 'chassis' column will provide uniqueness of chassis names,
>>>>> map has unique keys, so all variable names are unique within a chassis
>>>>> as well. This should cover all the possible misconfigurations on the
>>>>> database level.
>>>>>
>>>>> As a bonus we will also save a lot of database space by not storing
>>>>> millions of copies of chassis and variable names. May speed up the
>>>>> Nb->Sb synchronization as well.
>>>>>
>>>>> One downside is that we can't have true I-P for that table due to
>>>>> inability to track which values in a map actually changed. Though
>>>>> it should be possible to figure out the diff in a semi-linear time
>>>>> from the number of variables. OpenFlow rules can still be processed
>>>>> incrementally after that. So, I'm not sure if that is a big
>> performance
>>>>> concern. Testing is needed, I guess.
>>>>>
>>>>> Conditional monitoring will be very simple. Chassis will receive only
>>>>> one row in most cases. update2 will cover variable updates.
>>>>>
>>>>> What do you think?
>
> Interesting, thanks for the suggestion!
>
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>>> Thanks Ilya for the idea. I think it is definitely a good candidate to
>> be evaluated, but I am slightly in favor of the original schema, and here
>> is why.
>>>> Firstly, I think the benefit of avoiding misconfigurations is somehow
>> not that important, if we agree that it is good enough to stick with
>> "chassis_name" and don't allow hostname matching.
>>>
>>> To be clear I wasn't arguing about hostnames, I don't think they are
>>> needed. It was a suggestion for a general look of a schema regardless.
>>>
>>>>
>>>> So the more important factor to consider is actually the performance.
>> On this front, the upside is, as you mentioned, the size of the table would
>> be smaller. However, I don't see it as a very strong benefit. The only
>> redundant part here is the chassis_name, which is usually an uuid, not too
>> short but also not too long.
>>>
>>> UUID takes 38 bytes. 250node setup with 10K vars will result
>>> in about 100 MB overhead. It is noticeable. Not a huge amount,
>>> I agree, but fairly noticeable.
>>>
>>>> I agree that the waste of space for this can be big if there are tens of
>> thousands of vars per chassis, but compared with the logical flow table
>> size this is still a very small portion of the total size. In addition, the
>> DB sync matters mostly at a chassis startup, and then each incremental
>> change should be very small.
>>>
>>> I was thinking about Nb->Sb sync on northd. We don't really have
>>> any real I-P there (or did I missed it in the patch set? I didn't
>>> look very close), so northd will need to re-check all that on
>>> every iteration. Since this scales up very fast with the number
>>> of nodes, I'm concerned about northd performance here.
>>>
>> Ok, makes sense. Sorry that I didn't pay attention that you were talking
>> about nb->sb instead of sb->hv.
>>
>
> Chassis template variables are completely unrelated to any other tables
> populated by northd. So I think it's (relatively) easy to add I-P in
> northd for chassis template var processing.
>
>>>>
>>>> On the other hand, I am more concerned with the OVSDB performance for
>> handling the huge map/set column data. I understand that there have been
>> lots of improvements made for the mutations, but I wonder if it is still
>> less efficient than simple CRUDs for very small rows.
>>>
>>> Almost. I'm not concerned about performance of ovsdb-server.
>>> We're frequently running large scale tests with ovn-heater with
>>> tens of thousands load-balancers in a single group, for example,
>>> and ovsdb-server is barely loaded.
>>>
>>>> It *might* also be easier to further optimize the row based operations
>> (such as creating index for more efficient conditional monitor updates)
>> than for big columns.
>>>>
>>>> And of course, a strong negative side as you mentioned, is the I-P
>> handling, which is definitely important. The current patch shows great
>> simplicity in I-P compared with the address-set I-P that had a lot of
>> complexity and overhead just for preprocessing and tracking the changes.
>>>
>>> It's a simple smap comparison that should be linear in time to
>>> figure out what changed. OpenFlows will be processed incrementally
>>> in the same way as they are in the current patch set. So, I'm
>>> not sure if the performance impact will be noticeable. Again,
>>> someone probably need to test that.
>>>
>> For performance, yes, the comparison is linear, but if N is very big, the
>> cost may still be high (we have much less tolerance of CPU cost in
>> ovn-controller than in ovn-northd). I agree a test would help to understand
>> the real impact.
>>
>> However, hopefully N shouldn't be so big. I had the impression that N would
>> be the same as the number of LBs, but with a second thought I believe it
>> should be much less if configured properly. For VIP, each nodeport LB for
>> the same node should share the same var; for LB port, the worst case is
>> that each LB is assigned a port per node, but I wonder CMS should be smart
>> enough to just assign the same ports across all nodes thus not even require
>> the LB port to be included in the template var.
>>
>> If that's the case, probably the performance concerns are less important
>> for either approach. The code complexity may still be considered.
>>
>
> Looking at the code that generates node port service load balancers in
> ovn-kubernetes [0] [1] indeed for the VIP the nodeport LB can use a
> single template variable for all load balancers. For backends, on the
> other hand, it's a bit more complicated because in some cases
> (externalTrafficLocal/internalTrafficLocal) the set of backend IPs is
> different per node. So it's not only the LB port that differs.
>
> In such cases we need a distinct backend template variable too.
>
> So, for a setup with N nodes and M node-port services, we end up with
> (worst case) N x M template variable records in the NB/SB.
>
> [0]
> https://github.com/ovn-org/ovn-kubernetes/blob/61bd7e49d2286bf59b4f2175bef3ba374fb1b99d/go-controller/pkg/ovn/controller/services/load_balancer.go#L79
> [1]
> https://github.com/ovn-org/ovn-kubernetes/blob/61bd7e49d2286bf59b4f2175bef3ba374fb1b99d/go-controller/pkg/ovn/controller/services/load_balancer.go#L299
>
>>>>
>>>> So, for the above reasons, I tend to stay with the current schema,
>> unless
>>>> 1) we are sure that handling huge map columns is as efficient as
>> separate small rows in OVSDB;
>>>
>>> Should not be a problem with ovsdb-server 2.17 or higher.
>>>
>>>> 2) we improve IDL change tracking to support tracking mutations of
>> map/set columns gracefully.
>>>
>>> I'm not sure if smap comparison is a huge problem as I said above,
>>> so I don't know. IDL re-parsing might cause issues though.
>>
>> Yes, thanks for reminding about the IDL cost.
>>
>>>
>>>> Thoughts?
>>>
>>> Anyway, not a strong opinion on my side. Just a suggestion or
>>> something to think about. :)
>>
>> I don't have a strong opinion either. Probably we should also consider from
>> CMS point of view, which way is more client friendly - managing separate
>> records for each var or combining them per chassis. @Tim Rozet
>> <[email protected]> @Girish Moodalbail <[email protected]>
>
> I still think in the worst case (many node port services with different
> sets of backends per node) the lack of proper I-P and IDL re-parsing
> cost in ovn-controller might make a difference.
>
> Tim, Girish what are worst case scenarios we could expect with
> ovn-kubernetes? I.e., how many nodes and how many node port services.
> And out of these how many with external/internal traffic policy configured?
>
I updated my benchmark scripts (I'll try to share them in full in the
next revision of the series) and I got the following results (simulating
a scaled ovn-kubernetes deployment with node-port services and distinct
sets of backends for each node-port service on every chassis):
----------------------------------------------------------------------------------------------------------------------------------------------------
| | | | NB | SB |
| northd | ovn-controller |
----------------------------------------------------------------------------------------------------------------------------------------------------
| Template Type | Nodes | NodePort Services | Size(MB) | RSS(MB) | Size(MB) |
RSS(MB) | loop(sec) | RSS (MB) | recompute(sec) | template I-P(sec) |
----------------------------------------------------------------------------------------------------------------------------------------------------
| None | 120 | 2000 | 67 | 865 | 471 |
9000 | 15.6 | 1016 | 0.4 | 0 |
| Expanded | 120 | 2000 | 42 | 435 | 47 |
972 | 0.52 | 83 | 0.4 | 0.001 |
| Map | 120 | 2000 | 23 | 96 | 28 |
225 | 0.22 | 83 | 0.4 | 0.004 |
----------------------------------------------------------------------------------------------------------------------------------------------------
| Expanded | 120 | 10000 | 212 | 2153 | 230 |
4675 | 3.32 | 308 | 2.042 | 0.001 |
| Map | 120 | 10000 | 118 | 440 | 136 |
668 | 0.72 | 311 | 1.77 | 0.021 |
----------------------------------------------------------------------------------------------------------------------------------------------------
| Expanded | 250 | 10000 | 440 | 4442 | 460 |
9327 | 6.5 | 321 | 1.87 | 0.001 |
| Map | 250 | 10000 | 244 | 870 | 263 |
1502 | 1.26 | 318 | 1.87 | 0.023 |
----------------------------------------------------------------------------------------------------------------------------------------------------
"Expanded" is the schema I had originally suggested in v1. "Map" is the
schema Ilya suggested (single map per chassis). "Map" is the clear
winner here so I'll go for that in v2. Also, just a note, I only
implemented the brute-force template variable sync in ovn-northd
but we can easily add support for I-P there to reduce the duration
of the northd loop when updating template variables.
Regards,
Dumitru
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev