Re: [ovs-discuss] Scaling OVN/Southbound

Han Zhou via discuss Thu, 06 Jul 2023 23:12:46 -0700

On Fri, Jul 7, 2023 at 1:21 PM Han Zhou <hz...@ovn.org> wrote:
>
>
>
> On Thu, Jul 6, 2023 at 1:28 AM Terry Wilson <twil...@redhat.com> wrote:
> >
> > On Wed, Jul 5, 2023 at 9:59 AM Terry Wilson <twil...@redhat.com> wrote:
> > >
> > > On Fri, Jun 30, 2023 at 7:09 PM Han Zhou via discuss
> > > <ovs-discuss@openvswitch.org> wrote:
> > > >
> > > >
> > > >
> > > > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss <
ovs-discuss@openvswitch.org> wrote:
> > > > >
> > > > > Hi Ilya,
> > > > >
> > > > > thank you for the detailed reply
> > > > >
> > > > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > > > Hi everyone,
> > > > > >
> > > > > > Hi, Felix.
> > > > > >
> > > > > > >
> > > > > > > we are currently running an OVN Deployment with 450 Nodes. We
run a 3 node cluster for the northbound database and a 3 nodes cluster for
the southbound database.
> > > > > > > Between the southbound cluster and the ovn-controllers we
have a layer of 24 ovsdb relays.
> > > > > > > The setup is using TLS for all connections, however the TLS
Server is handled by a traefik reverseproxy to offload this from the ovsdb
> > > > > >
> > > > > > The very important part of the system description is what
versions
> > > > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > > > improvements are actually needed.
> > > > > >
> > > > >
> > > > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > > > upgrading to 23.03). `monitor-all` is currently disabled, but we
want to
> > > > > try that as well.
> > > > >
> > > > Hi Felix, did you try upgrading and enabling "monitor-all"? How
does it look now?
> > > >
> > > > > > > Northd and Neutron is connecting directly to north- and
southbound databases without the relays.
> > > > > >
> > > > > > One of the big things that is annoying is that Neutron connects
to
> > > > > > Southbound database at all.  There are some reasons to do that,
> > > > > > but ideally that should be avoided.  I know that in the past
limiting
> > > > > > the number of metadata agents was one of the mitigation
strategies
> > > > > > for scaling issues.  Also, why can't it connect to relays?
There
> > > > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > > > from the Neutron.
> > > > > >
> > > > >
> > > > > Thanks for that suggestion, that definately makes sense.
> > > > >
> > > > Does this make a big difference? How many Neutron - SB connections
are there?
> > > > What rings a bell is that Neutron is using the python OVSDB library
which hasn't implemented the fast-resync feature (if I remember correctly).
> > >
> > > python-ovs has supported monitor_cond_since since v2.17.0 (though
> > > there may have been a bug that was fixed in 2.17.1). If fast resync
> > > isn't happening, then it should be considered a bug. With that said, I
> > > remember when I looked it a year or two ago, ovsdb-server didn't
> > > really use fast resync/monitor_cond_since unless it was running in
> > > raft cluster mode (it would reply, but with the last-txn-id as 0
> > > IIRC?). Does the ovsdb-relay code actually return the last-txn-id? I
> > > can set up an environment and run some tests, but maybe someone else
> > > already knows.
> >
> > Looks like ovsdb-relay does support last-txn-id now:
> >
https://github.com/openvswitch/ovs/commit/a3e97b1af1bdcaa802c6caa9e73087df7077d2b1
,
> > but only in v3.0+.
> >
>
> Hi Terry, thanks for correcting me, and sorry for my bad memory! And you
are right that fast resync is supported only in cluster mode.
>
> Han
>
> > > > At the same time, there is the feature
leader-transfer-for-snapshot, which automatically transfer leader whenever
a snapshot is to be written, which would happen frequently if your
environment is very active.
> > >
> > > I believe snapshot should only be happening "no less frequently than
> > > 24 hours, with snapshots if there are more than 100 log entries and
> > > the log size has doubled, but no more frequently than every 10 mins"
> > > or something pretty close to that. So it seems like once the system
> > > got up to its expected size, you would just see updates every 24 hours
> > > since you obviously can't double in size forever. But it's possible
> > > I'm reading that wrong.
> > >
Sorry I forgot to comment on this. It is actually not this way. Suppose you
have a server with size N after snapshot/compaction, but there are tens of
transactions per second (delete, add, update, etc.), and then the log grows
quickly and it may take just several minutes to double the size of the log
(but not the actual DB data), so it will soon trigger a snapshot which will
reduce the log size back to N. So it is not uncommon to see snapshots every
10 minutes.


Regards,
Han

> > > > When a leader transfer happens, if Neutron set the option
"leader-only" (only connects to leader) to SB DB (could someone confirm?),
then when the leader transfer happens, all Neutron workers would reconnect
to the new leader. With fast-resync, like what's implemented in C IDL and
Go, the client that has cached the data would only request the delta when
reconnecting. But since the python lib doesn't have this, the Neutron
server would re-download full data when reconnecting ...
> > > > This is a speculation based on the information I have, and the
assumptions need to be confirmed.
> > > >
> > > > > > >
> > > > > > > We needed to increase various timeouts on the ovsdb-server
and client side to get this to a mostly stable state:
> > > > > > > * inactivity probes of 60 seconds (for all connections
between ovsdb-server, relay and clients)
> > > > > > > * cluster election time of 50 seconds
> > > > > > >
> > > > > > > As long as none of the relays restarts the environment is
quite stable.
> > > > > > > However we see quite regularly the "Unreasonably long xxx ms
poll interval" messages ranging from 1000ms up to 40000ms.
> > > > > >
> > > > > > With latest versions of OVS/OVN the CPU usage on Southbound DB
> > > > > > servers without relays in our weekly 500-node ovn-heater runs
> > > > > > stays below 10% during the test phase.  No large poll intervals
> > > > > > are getting registered.
> > > > > >
> > > > > > Do you have more details on under which circumstances these
> > > > > > large poll intervals occur?
> > > > > >
> > > > >
> > > > > It seems to mostly happen on the initial connection of some
client to
> > > > > the ovsdb. From the few times we ran perf there it looks like the
time
> > > > > is spend in creating a monitor and during that sending out the
updates
> > > > > to the client side.
> > > > >
> > > > It is one of the worst case scenario for OVSDB when many clients
initialize connections to it at the same time, when the size of data
downloaded by each client is big.
> > > > OVSDB relay, for what I understand, should greatly help on this.
You have 24 relay nodes, which are supposed to share the burden. Are the SB
DB and the relay instances running with sufficient CPU resources?
> > > > Is it clear that initial connections from which clients
(ovn-controller or Neutron) are causing this? If it is Neutron, the above
speculation about the lack of fast-resync from Neutron workers may be worth
checking.
> > > >
> > > > > If it is of interest i can try and get a perf report once this
occurs
> > > > > again.
> > > > >
> > > > > > >
> > > > > > > If a large amount of relays restart simultaneously they can
also bring the ovsdb cluster to fail as the poll interval exceeds the
cluster election time.
> > > > > > > This happens with the relays already syncing the data from
all 3 ovsdb servers.
> > > > > >
> > > > > > There was a performance issue with upgrades and simultaneous
> > > > > > reconnections, but it should be mostly fixed on the current
master
> > > > > > branch, i.e. in the upcoming 3.2 release:
> > > > > >
https://patchwork.ozlabs.org/project/openvswitch/list/?series=348259&state=*
> > > > > >
> > > > >
> > > > > That sounds like that might be similar to when our issue occurs.
I'll
> > > > > see if we can try this out.
> > > > >
> > > > > > >
> > > > > > > We would like to improve this significantly to ensure on the
one hand that our ovsdb clusters will survive unplanned load without issues
and on the other hand to keep the poll intervals short.
> > > > > > > We would like to ensure a short poll interval to allow us to
act on distributed-gateway-ports failovers and failover of virtual port in
a timely manner (ideally below 1 second).
> > > > > >
> > > > > > These are good goals.  But are you sure they are not already
> > > > > > addressed with the most recent versions of OVS/OVN ?
> > > > > >
> > > > >
> > > > > I was not sure, but all your feedback helped clarifying that.
> > > > >
> > > > > > >
> > > > > > > To do this we found the following solutions that were
discussed in the past:
> > > > > > > 1. Implementing multithreading for ovsdb
https://patchwork.ozlabs.org/project/openvswitch/list/?series=&submitter=&state=*&q=multithreading&archive=&delegate=
> > > > > >
> > > > > > We moved the compaction process to a separate thread in 3.0.
> > > > > > This partially addressed the multi-threading topic.  General
> > > > > > handling of client requests/updates in separate threads will
> > > > > > require significant changes in the internal architecture,
AFAICT.
> > > > > > So, I'd like to avoid doing that unless necessary.  So far we
> > > > > > were able to overcome almost all the performance challenges
> > > > > > with simple algorithmic changes instead.
> > > > > >
> > > > >
> > > > > I definately get that since that would be quite a complex change
to do.
> > > > > The only benefit i would see in having clients in separate
threads is
> > > > > that it reduces the impact of performance challenges.
> > > > > E.g. it would still allow the cluster to healthly work together
and make
> > > > > progress, but individual reconnects would be slow.
> > > > >
> > > > > That benefit would be quite significant from my perspective as it
makes
> > > > > the solution more resillient. But i'm not sure if its worth the
> > > > > additional complexity.
> > > > >
> > > > Multithreading for general OVSDB tasks (transactions, monitoring)
seems more complex to implement, and the outcome should be very similar to
OVSDB relay (which is multi-process instead of multi-threading), except
that multi-threading may have a smaller memory footprint.
> > > > Multithreading for RAFT cluster RPC may help keeping the cluster
healthy when server load is high, but the same can be achieved by setting a
longer election timer. I agree there is a subtle difference when you want
fast failure detection for things like node crash but can tolerate
overloaded servers that can barely respond to clients.
> > > >
> > > > Looking forward to hearing back from you regarding the situation.
> > > >
> > > > Thanks,
> > > > Han
> > > >
> > > > > > > 2. Changing the storage backend of OVN to an alternative
(e.g. etcd)
https://mail.openvswitch.org/pipermail/ovs-discuss/2016-July/041733.html
> > > > > >
> > > > > > There was an ovsdb-etcd project, but it didn't manage to provide
> > > > > > better performance in comparison with ovsdb-server.  So it was
> > > > > > ultimately abandoned: https://github.com/IBM/ovsdb-etcd
> > > > > >
> > > > > > >
> > > > > > > Both of these discussion are from 2016, not sure if more
up-to-date ones exist.
> > > > > > >
> > > > > > > I would like to ask if there are already existing discussions
on scaling ovsdb further/faster?
> > > > > >
> > > > > > This again comes to a question what versions you're using.  I'm
> > > > > > currently not aware of any major performance issues for
ovsdb-server
> > > > > > on the most recent code, besides the conditional monitoring,
which is
> > > > > > not entirely OVSDB server's issue.  And it is also likely to
become
> > > > > > a bit better in 3.2:
> > > > > >
https://patchwork.ozlabs.org/project/openvswitch/patch/20230518121425.550048-1-i.maxim...@ovn.org/
> > > > > >
> > > > >
> > > > > That also sounds like a quite interesting change that might help
us
> > > > > here.
> > > > >
> > > > > > >
> > > > > > > From my perspective whatever such a solution might be, would
no longer require relays and allow the ovsdb servers to handle load
gracefully.
> > > > > > > I personally see that multithreading for ovsdb sounds quite
promising, as that would allow us to separate the raft/cluster
communication from the client connections.
> > > > > > > This should allow us to keep the cluster healthly even under
significant pressure of clients.
> > > > > >
> > > > > > Again, good goals.  I'm just not sure if we actually need to do
> > > > > > something or if they are already achievable with the most
recent code.
> > > > > >
> > > > > > I understand that testing on prod is not an option, so it's
unlikely
> > > > > > we'll have an accurate test.  But maybe you can participate in
the
> > > > > > initiative [1] for creation of ovn-heater OpenStack scenarios
that
> > > > > > might be close to workloads you have?  This way upstream will
be able
> > > > > > to test your use-cases or at least something similar.
> > > > > >
> > > > > > Most of our current efforts are focused on ovn-kubernetes
use-case,
> > > > > > because we don't have much details on how high-scale OpenStack
deployments
> > > > > > look like.
> > > > > >
> > > > > > [1]
https://mail.openvswitch.org/pipermail/ovs-dev/2023-May/404488.html
> > > > > >
> > > > >
> > > > > That looks very interesting and would also help us running scale
tests.
> > > > > I'll get in contact with whoever is working on that to help out
as well.
> > > > >
> > > > > > Best regards, Ilya Maximets.
> > > > > >
> > > > > > >
> > > > > > > Thank you
> > > > > > >
> > > > > > > --
> > > > > > > Felix Huettner
> > > > > >
> > > > >
> > > > > Thanks for all of the detailed insights.
> > > > > Felix
> > > > > Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist
nur für die Verwertung durch den vorgesehenen Empfänger bestimmt.
> > > > > Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den
Absender bitte unverzüglich in Kenntnis und löschen diese E Mail.
> > > > >
> > > > > Hinweise zum Datenschutz finden Sie hier<
https://www.datenschutz.schwarz>.
> > > > >
> > > > >
> > > > > This e-mail may contain confidential content and is intended only
for the specified recipient/s.
> > > > > If you are not the intended recipient, please inform the sender
immediately and delete this e-mail.
> > > > >
> > > > > Information on data protection can be found here<
https://www.datenschutz.schwarz>.
> > > > > _______________________________________________
> > > > > discuss mailing list
> > > > > disc...@openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > > _______________________________________________
> > > > discuss mailing list
> > > > disc...@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] Scaling OVN/Southbound

Reply via email to