Re: [ovs-discuss] OVN SB DB server overload when restarted at large scale environment

Ben Pfaff Wed, 31 Oct 2018 16:34:47 -0700

On Tue, Oct 30, 2018 at 11:51:05PM -0700, Han Zhou wrote:
> On Tue, Oct 30, 2018 at 11:15 AM Ben Pfaff <[email protected]> wrote:
> >
> > On Wed, Oct 24, 2018 at 05:42:15PM -0700, Han Zhou wrote:
> > > On Tue, Sep 25, 2018 at 10:18 AM Han Zhou <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Thu, Sep 20, 2018 at 4:43 PM Ben Pfaff <[email protected]> wrote:
> > > > >
> > > > > On Thu, Sep 13, 2018 at 12:28:27PM -0700, Han Zhou wrote:
> > > > > > In scalability test with ovn-scale-test, ovsdb-server SB load is
> not a
> > > > > > problem at least with 1k HVs. However, if we restart the
> ovsdb-server,
> > > > > > depending on the number of HVs and scale of logical objects, e.g.
> the
> > > > > > number of logical ports, ovsdb-server of SB become an obvious
> > > bottleneck.
> > > > > >
> > > > > > In our test with 1k HVs and 20k logical ports (200 lport * 100
> > > lswitches
> > > > > > connected by one single logical router). Restarting ovsdb-server
> of SB
> > > > > > resulted in 100% CPU of ovsdb-server for more than 1 hour. All HVs
> > > (and
> > > > > > northd) are reconnecting and resyncing the big amount of data at
> the
> > > same
> > > > > > time. Considering the amount of data and json rpc cost, this is
> not
> > > > > > surprising.
> > > > > >
> > > > > > At this scale, SB ovsdb-server process has RES 303848KB before
> > > restart. It
> > > > > > is likely a big proportion of this size is SB DB data that is
> going
> > > to be
> > > > > > transferred to all 1,001 clients, which is about 300GB. With a
> 10Gbps
> > > NIC,
> > > > > > even the pure network transmission would take ~5 minutes.
> Considering
> > > the
> > > > > > actual size of JSON RPC would be much bigger than the raw data,
> and
> > > the
> > > > > > processing cost of the single thread ovsdb-server, 1 hour is
> > > reasonable.
> > > > > >
> > > > > > In addition to the CPU cost of ovsdb-server, the memory
> consumption
> > > could
> > > > > > also be a problem. Since all clients are syncing data from it,
> > > probably due
> > > > > > to the buffering, RES increases quickly, spiked to 10G at some
> point.
> > > After
> > > > > > all the syncing finished, the RES is back to the similar size as
> > > before
> > > > > > restart. The client side (ovn-controller, northd) were also seeing
> > > memory
> > > > > > spike - it is a huge JSON RPC for the new snapshot of the whole
> DB to
> > > be
> > > > > > downloaded, so it is just buffered until the whole message is
> > > received -
> > > > > > RES peaked at the doubled size of its original size, and then went
> > > back to
> > > > > > the original size after the first round of processing of the new
> > > snapshot.
> > > > > > This means for deploying OVN, this memory spike should be
> considered
> > > for
> > > > > > the SB DB restart scenario, especially the central node.
> > > > > >
> > > > > > Here is some of my brainstorming of how could we improve on this
> (very
> > > > > > rough ones at this stage).
> > > > > > There are two directions: 1) reducing the size of data to be
> > > transferred.
> > > > > > 2) scaling out ovsdb-server.
> > > > > >
> > > > > > 1) Reducing the size of data to be transferred.
> > > > > >
> > > > > > 1.1) Using BSON instead of JSON. It could reduce the size of data,
> > > but not
> > > > > > sure yet how much it could help since most of the data are
> strings. It
> > > > > > might be even worse since the bottleneck is not yet the network
> > > bandwidth
> > > > > > but processing power of ovsdb-server.
> > > > > >
> > > > > > 1.2) Move northd processing to HVs - only relevant NB data needs
> to be
> > > > > > transfered, which is much smaller than the SB DB because there is
> no
> > > > > > logical flows. However, this would lead to more processing load on
> > > > > > ovn-controller on HVs. Also, it is a big/huge architecture change.
> > > > > >
> > > > > > 1.3) Incremental data transfer. The way IDL works is like a cache.
> > > Now when
> > > > > > connection reset the cache has to be rebuilt. But if we know the
> > > version
> > > > > > the current snapshot, even when connection is reset, the client
> can
> > > still
> > > > > > communicate with the newly started server to tell the difference
> of
> > > the
> > > > > > current data and the new data, so that only the delta is
> transferred,
> > > as if
> > > > > > the server is not restarted at all.
> > > > > >
> > > > > > 2) Scaling out the ovsdb-server.
> > > > > >
> > > > > > 2.1) Currently ovsdb-server is single threaded, so that single
> thread
> > > has
> > > > > > to take care of transmission to all clients with 100% CPU. If it
> is
> > > > > > mutli-threaded, more cores can be utilized to make this much
> faster.
> > > > > >
> > > > > > 2.2) Using ovsdb cluster. This feature is supported already but I
> > > haven't
> > > > > > tested it in this scenario yet. If everything works as expected,
> > > there can
> > > > > > be 3 - 5 servers sharing the load, so the transfer should be
> > > completed 3 -
> > > > > > 5 times faster than it is right now. However, this is a limit of
> how
> > > many
> > > > > > nodes there can be in a cluster, so the problem can be alleviated
> but
> > > may
> > > > > > still be a problem if the data size goes bigger.
> > > > > >
> > > > > > 2.3) Using readonly copies of ovsdb replications. If
> ovn-controller
> > > > > > connects to readonly copies, we can deploy a big number of
> > > ovsdb-servers of
> > > > > > SB, which replicates from a common source - the read/write one
> > > populated by
> > > > > > ovn-northd. It can be a multi-layered (2 - 3 layer is big enough)
> tree
> > > > > > structure, so that each server only serves a small number of
> clients.
> > > > > > However, today there are some scenarios requires ovn-controller to
> > > write
> > > > > > data to SB, such as dynamic mac-binding (neighbor table
> populating),
> > > nb_cfg
> > > > > > sync feature, etc.
> > > > > >
> > > > > > These ideas are not mutual exclusive, and the order is random just
> > > > > > according to my thought process. I think most of them are worth to
> > > try, but
> > > > > > not sure about priority (except that 1.2 is almost out of question
> > > since I
> > > > > > don't think it is a good idea to do any architecture level change
> at
> > > this
> > > > > > phase). Among the ideas, I think 1.3) 2.1) and 2.3) are the ones
> that
> > > > > > should have the best result (if they can be implemented with
> > > reasonable
> > > > > > effort).
> > > > >
> > > > > It sounds like reducing the size is essential, because you say that
> the
> > > > > sheer quantity of data is 5 minutes worth of raw bandwidth.  Let's
> go
> > > > > through the options there.
> > > > >
> > > > > 1.1, using BSON instead of JSON, won't help sizewise.  See
> > > > > http://bsonspec.org/faq.html.
> > > > >
> > > > > 1.2 would change the OVN architecture, so I don't think it's a good
> > > > > idea.
> > > > >
> > > > > 1.3, incremental data transfer, is an idea that Andy Zhou explored a
> > > > > little bit before he left.  There is some description of the
> approach I
> > > > > suggested in ovn/TODO.rst:
> > > > >
> > > > >   * Reducing startup time.
> > > > >
> > > > >     As-is, if ovsdb-server restarts, every client will fetch a fresh
> > > copy of
> > > > >     the part of the database that it cares about.  With hundreds of
> > > clients,
> > > > >     this could cause heavy CPU load on ovsdb-server and use
> excessive
> > > network
> > > > >     bandwidth.  It would be better to allow incremental updates even
> > > across
> > > > >     connection loss.  One way might be to use "Difference Digests"
> as
> > > described
> > > > >     in Epstein et al., "What's the Difference? Efficient Set
> > > Reconciliation
> > > > >     Without Prior Context".  (I'm not yet aware of previous
> > > non-academic use of
> > > > >     this technique.)
> > > > >
> > > > > When Andy left VMware, the project got dropped, but it could be
> picked
> > > > > up again.
> > > > >
> > > > > There are other ways to implement incremental data transfer, too.
> > > > >
> > > > > Scaling out ovsdb-server is a good idea too, but I think it's
> probably
> > > > > less important for this particular case than reducing bandwidth
> > > > > requirements
> > > > >
> > > > > 2.1, multithreading, is also something that Andy explored; again,
> the
> > > > > project would have to be resumed or restarted.
> > > >
> > > > Thanks Ben! It seems 1.3 (incremental data transfer) is the most
> > > effective approach to solve this problem.
> > > > I had a brief study on the "Difference Digests" paper. For my
> > > understanding it is particularly useful when there is no prior context.
> > > However, in OVSDB use case, especially in this OVN DB restart scenario,
> we
> > > do have the context about the last data received from the server. I
> think
> > > it would be more efficient (no full data scanning and encoding) and
> maybe
> > > simpler to implement based on the append-only (in most cases, for the
> part
> > > of data that hasn't been compressed yet) nature of OVSDB. Here is what I
> > > have in mind:
> > > >
> > > With some more study, here are some more details added inline:
> > >
> > > > - We need a versioning mechanism. For each transaction record in OVSDB
> > > file, it needs a unique version. The hash string may be used for this
> > > purpose, so that the file format doesn't need to be changed. If we allow
> > > the file format to be changed, it may be better to have a version number
> > > that is sequentially increased.
> > > >
> > > For standalone format, the hash value can be used as unique version id.
> > > (what's the chance of the 10 bytes hash value having a conflict?)
> > > For clustered DB format, the eid is perfect for this purpose.
> > > (Sequentially increasing seems not really necessary, since we only need
> to
> > > keep a small amount of history transactions for the DB restart scenario)
> > >
> > > > - We can include the latest version number in every OVSDB notification
> > > from server to client, and the client IDL records the version.
> > > >
> > > > - When a client reconnects to server, it can request to get only the
> > > changes after “last version”.
> > > >
> > > > - When a server starts up, it reads the DB file and keeps track of the
> > > “version” for last N (e.g. 100) transactions, and maintains the changes
> in
> > > memory of that N transactions.
> > >
> > > The current implementation keeps track in monitors for the transactions
> > > that haven't been flushed to all clients. We can extend this by keeping
> > > track of extra N previous transactions. Now for the DB restart scenario,
> > > the previous N transactions are read from DB file.
> > >
> > > Currently, during reading the DB file, there is no monitoring connected
> > > yet, so replaying transactions will not trigger monitor data
> population. We
> > > can create a fake monitor for all tables beforehand so that we can reuse
> > > the code to populate monitor data while replaying DB file transactions.
> > > When a real monitor is created, it copies and converts the data in the
> fake
> > > monitor to its own table according to the monitor criteria. Flushing the
> > > data for the fake monitor can be done so that maximumly M transactions
> will
> > > be kept in the monitor (because there is no real client to consume the
> fake
> > > monitor data).
> > >
> > > >
> > > > - When a new client asks data after a given “version”, if the version
> is
> > > among the N transactions, the server sends the data afterwards to the
> > > client. If the given version is not found (e.g. when client reconnect,
> > > there is already lot of changes happened in server and the old changes
> were
> > > already flushed out, or a DB compression has been performed thus the
> > > transaction data is gone), server can:
> > > >     - Option1: return an error telling the client the version is not
> > > available, and client can re-request the whole snapshot
> > > >     - Option2: server directly send the whole snapshot, with a flag
> > > indicating this is the whole snapshot instead of the delta
> > > >
> > > Option2 seems better.
> > >
> > > As to the OVSDB protocol change, since we need to add a version id to
> every
> > > update notification, it would be better to have a new method
> > > "monitor_cond_since":
> > >
> > > "method": "monitor_cond_since"
> > > "params": [<db-name>, <json-value>, <monitor-cond-requests>,
> > > <latest_version_id>]
> > > "id": <nonnull-json-value>
> > >
> > > <latest_version_id> is the version id that identifies the latest data
> the
> > > client already has. Everything else is same as "monitor_cond" method.
> > >
> > > The response object has the following members:
> > > "result": [<found>, <latest_version_id>, <table-updates2>]
> > > "error": null
> > > "id": same "id" as request
> > >
> > > <found> is a boolean value that tells if the <latest_version_id>
> requested
> > > by client is found in history or not. If true, the data after that
> version
> > > up to current is sent. Otherwise, all data is sent.
> > > <latest_version_id> is the version id that identifies the latest change
> > > involved in this response, so that client can keep track.
> > > Following changes will be notified to client using "update3" method:
> > >
> > > "method": "update3"
> > > "params": [<json-value>, <latest_version_id>, <table-updates2>]
> > > "id": null
> > >
> > > Similar as the response to "monitor_cond_since", <latest_version_id> is
> > > added in the update3 method.
> > >
> > > > - Client will not destroy the old copy of data, unless the requested
> > > version is not available and it has to reinitialize with the whole DB.
> > > >
> > > > This is less general than the Difference Digests approach, but I
> think it
> > > is sufficient and more relevant for OVSDB use case. I am sure there
> will be
> > > details need more consideration, e.g. OVSDB protocol update, caching in
> > > server, etc., but do you think this is a right direction?
> > >
> > > Ben, please suggest if this is reasonable so that I could go ahead with
> a
> > > POC, or please let me know if you see obvious problems/pitfalls.
> >
> > I think that a lot of this analysis is on point.
> >
> > I'm pretty worried about the actual implementation.  The monitor code is
> > already hard to understand.  I am not confident that there is a
> > straightforward way to keep track of the last 100 transactions in a
> > sensible way.  (Bear in mind that different clients may monitor
> > different portions of the database.)  It might make sense to introduce
> > a multiversioned data structure to keep track of the database
> > content--maybe it would actually simplify some things.
> 
> I understand the concern about the complexity. Thinking more about it, the
> problem is only about DB restart, so we don't really need to keep track of
> the last N transactions all the time. We only need to have them available
> within a short period after the initial DB read. When all the old clients
> are reconnected and requested the difference, these data are useless, and
> going forward, new clients won't need any previous transactions either. So
> instead of always keeping track of last N and maintaining a sliding-window,
> we just need the initial part, which can be generated by using the "fake
> monitor" approach I mentioned before, and then each new real monitor
> (created after clients reconnecting) just selectively copy the data with a
> *timestamp* added to the data entry, so that we can free them after a
> certain amount of time (e.g. 5 minutes). For the entries that are NOT
> copied from the fake monitor, which means they are new transactions, they
> can be freed any time their predecessors are freed. The "fake monitor"
> itself will have a timestamp, too, so that it can be deleted after the
> initial period. Would this simplify the problem? (or maybe my description
> makes it sounds more complex :))
>
> Could you explain a little more about the multiversioned data structure
> idea? I am not sure I understand it correctly.


Well, how do you plan to maintain the multiple copies of the database
that will be necessary?  Presumably each monitor needs a copy of a
slightly different database.  Or maybe I just don't understand your plan
yet.

> > If we do this, we need solid and thorough tests to ensure that it's
> > reliable.  It might make sense to start by thinking through the tests,
> > rather than the implementation.
> 
> Good point, I may start with tests first.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN SB DB server overload when restarted at large scale environment

Reply via email to