On 7/5/23 18:00, Felix Huettner wrote:
> Hi Han,
> On Fri, Jun 30, 2023 at 05:08:36PM -0700, Han Zhou wrote:
>> On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
>> ovs-discuss@openvswitch.org> wrote:
>>> Hi Ilya,
>>> thank you for the detailed reply
>>> On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
>>>> On 5/23/23 15:59, Felix Hüttner via discuss wrote:
>>>>> Hi everyone,
>>>> Hi, Felix.
>>>>> we are currently running an OVN Deployment with 450 Nodes. We run a 3
>> node cluster for the northbound database and a 3 nodes cluster for the
>> southbound database.
>>>>> Between the southbound cluster and the ovn-controllers we have a
>> layer of 24 ovsdb relays.
>>>>> The setup is using TLS for all connections, however the TLS Server is
>> handled by a traefik reverseproxy to offload this from the ovsdb
>>>> The very important part of the system description is what versions
>>>> of OVS and OVN are you using in this setup?  If it's not latest
>>>> 3.1 and 23.03, then it's hard to talk about what/if performance
>>>> improvements are actually needed.
>>> We are currently running ovs 3.1 and ovn 22.12 (in the process of
>>> upgrading to 23.03). `monitor-all` is currently disabled, but we want to
>>> try that as well.
>> Hi Felix, did you try upgrading and enabling "monitor-all"? How does it
>> look now?
> we did not yet upgrade, but we tried monitor-all and that provided a big
> benefit in terms of stability.
>>>>> Northd and Neutron is connecting directly to north- and southbound
>> databases without the relays.
>>>> One of the big things that is annoying is that Neutron connects to
>>>> Southbound database at all.  There are some reasons to do that,
>>>> but ideally that should be avoided.  I know that in the past limiting
>>>> the number of metadata agents was one of the mitigation strategies
>>>> for scaling issues.  Also, why can't it connect to relays?  There
>>>> shouldn't be too many transactions flowing towards Southbound DB
>>>> from the Neutron.
>>> Thanks for that suggestion, that definately makes sense.
>> Does this make a big difference? How many Neutron - SB connections are
>> there?
>> What rings a bell is that Neutron is using the python OVSDB library which
>> hasn't implemented the fast-resync feature (if I remember correctly).
>> At the same time, there is the feature leader-transfer-for-snapshot, which
>> automatically transfer leader whenever a snapshot is to be written, which
>> would happen frequently if your environment is very active.
>> When a leader transfer happens, if Neutron set the option "leader-only"
>> (only connects to leader) to SB DB (could someone confirm?), then when the
>> leader transfer happens, all Neutron workers would reconnect to the new
>> leader. With fast-resync, like what's implemented in C IDL and Go, the
>> client that has cached the data would only request the delta when
>> reconnecting. But since the python lib doesn't have this, the Neutron
>> server would re-download full data when reconnecting ...
>> This is a speculation based on the information I have, and the assumptions
>> need to be confirmed.
> We are currently working with upstream neutron to get the leader-only
> flag removed wherever we can. I guess in total the amount of connections
> depends on the process count which would be ~150 connections in total in
> our case.
>>>>> We needed to increase various timeouts on the ovsdb-server and client
>> side to get this to a mostly stable state:
>>>>> * inactivity probes of 60 seconds (for all connections between
>> ovsdb-server, relay and clients)
>>>>> * cluster election time of 50 seconds
>>>>> As long as none of the relays restarts the environment is quite
>> stable.
>>>>> However we see quite regularly the "Unreasonably long xxx ms poll
>> interval" messages ranging from 1000ms up to 40000ms.
>>>> With latest versions of OVS/OVN the CPU usage on Southbound DB
>>>> servers without relays in our weekly 500-node ovn-heater runs
>>>> stays below 10% during the test phase.  No large poll intervals
>>>> are getting registered.
>>>> Do you have more details on under which circumstances these
>>>> large poll intervals occur?
>>> It seems to mostly happen on the initial connection of some client to
>>> the ovsdb. From the few times we ran perf there it looks like the time
>>> is spend in creating a monitor and during that sending out the updates
>>> to the client side.
>> It is one of the worst case scenario for OVSDB when many clients initialize
>> connections to it at the same time, when the size of data downloaded by
>> each client is big.
>> OVSDB relay, for what I understand, should greatly help on this. You have
>> 24 relay nodes, which are supposed to share the burden. Are the SB DB and
>> the relay instances running with sufficient CPU resources?
>> Is it clear that initial connections from which clients (ovn-controller or
>> Neutron) are causing this? If it is Neutron, the above speculation about
>> the lack of fast-resync from Neutron workers may be worth checking.
> Yep the relays really help. What i meant above was actually the relays
> failing because of overload. After that they restart and reconnect which
> caused a lot of load on the main ovsdbs.

FWIW, the initial connection after restart should cause way less trouble
with the latest v3.1.2 release.

Best regards, Ilya Maximets.
discuss mailing list

Reply via email to