On Tue, Oct 30, 2018 at 11:51:05PM -0700, Han Zhou wrote: > On Tue, Oct 30, 2018 at 11:15 AM Ben Pfaff <[email protected]> wrote: > > > > On Wed, Oct 24, 2018 at 05:42:15PM -0700, Han Zhou wrote: > > > On Tue, Sep 25, 2018 at 10:18 AM Han Zhou <[email protected]> wrote: > > > > > > > > > > > > > > > > On Thu, Sep 20, 2018 at 4:43 PM Ben Pfaff <[email protected]> wrote: > > > > > > > > > > On Thu, Sep 13, 2018 at 12:28:27PM -0700, Han Zhou wrote: > > > > > > In scalability test with ovn-scale-test, ovsdb-server SB load is > not a > > > > > > problem at least with 1k HVs. However, if we restart the > ovsdb-server, > > > > > > depending on the number of HVs and scale of logical objects, e.g. > the > > > > > > number of logical ports, ovsdb-server of SB become an obvious > > > bottleneck. > > > > > > > > > > > > In our test with 1k HVs and 20k logical ports (200 lport * 100 > > > lswitches > > > > > > connected by one single logical router). Restarting ovsdb-server > of SB > > > > > > resulted in 100% CPU of ovsdb-server for more than 1 hour. All HVs > > > (and > > > > > > northd) are reconnecting and resyncing the big amount of data at > the > > > same > > > > > > time. Considering the amount of data and json rpc cost, this is > not > > > > > > surprising. > > > > > > > > > > > > At this scale, SB ovsdb-server process has RES 303848KB before > > > restart. It > > > > > > is likely a big proportion of this size is SB DB data that is > going > > > to be > > > > > > transferred to all 1,001 clients, which is about 300GB. With a > 10Gbps > > > NIC, > > > > > > even the pure network transmission would take ~5 minutes. > Considering > > > the > > > > > > actual size of JSON RPC would be much bigger than the raw data, > and > > > the > > > > > > processing cost of the single thread ovsdb-server, 1 hour is > > > reasonable. > > > > > > > > > > > > In addition to the CPU cost of ovsdb-server, the memory > consumption > > > could > > > > > > also be a problem. Since all clients are syncing data from it, > > > probably due > > > > > > to the buffering, RES increases quickly, spiked to 10G at some > point. > > > After > > > > > > all the syncing finished, the RES is back to the similar size as > > > before > > > > > > restart. The client side (ovn-controller, northd) were also seeing > > > memory > > > > > > spike - it is a huge JSON RPC for the new snapshot of the whole > DB to > > > be > > > > > > downloaded, so it is just buffered until the whole message is > > > received - > > > > > > RES peaked at the doubled size of its original size, and then went > > > back to > > > > > > the original size after the first round of processing of the new > > > snapshot. > > > > > > This means for deploying OVN, this memory spike should be > considered > > > for > > > > > > the SB DB restart scenario, especially the central node. > > > > > > > > > > > > Here is some of my brainstorming of how could we improve on this > (very > > > > > > rough ones at this stage). > > > > > > There are two directions: 1) reducing the size of data to be > > > transferred. > > > > > > 2) scaling out ovsdb-server. > > > > > > > > > > > > 1) Reducing the size of data to be transferred. > > > > > > > > > > > > 1.1) Using BSON instead of JSON. It could reduce the size of data, > > > but not > > > > > > sure yet how much it could help since most of the data are > strings. It > > > > > > might be even worse since the bottleneck is not yet the network > > > bandwidth > > > > > > but processing power of ovsdb-server. > > > > > > > > > > > > 1.2) Move northd processing to HVs - only relevant NB data needs > to be > > > > > > transfered, which is much smaller than the SB DB because there is > no > > > > > > logical flows. However, this would lead to more processing load on > > > > > > ovn-controller on HVs. Also, it is a big/huge architecture change. > > > > > > > > > > > > 1.3) Incremental data transfer. The way IDL works is like a cache. > > > Now when > > > > > > connection reset the cache has to be rebuilt. But if we know the > > > version > > > > > > the current snapshot, even when connection is reset, the client > can > > > still > > > > > > communicate with the newly started server to tell the difference > of > > > the > > > > > > current data and the new data, so that only the delta is > transferred, > > > as if > > > > > > the server is not restarted at all. > > > > > > > > > > > > 2) Scaling out the ovsdb-server. > > > > > > > > > > > > 2.1) Currently ovsdb-server is single threaded, so that single > thread > > > has > > > > > > to take care of transmission to all clients with 100% CPU. If it > is > > > > > > mutli-threaded, more cores can be utilized to make this much > faster. > > > > > > > > > > > > 2.2) Using ovsdb cluster. This feature is supported already but I > > > haven't > > > > > > tested it in this scenario yet. If everything works as expected, > > > there can > > > > > > be 3 - 5 servers sharing the load, so the transfer should be > > > completed 3 - > > > > > > 5 times faster than it is right now. However, this is a limit of > how > > > many > > > > > > nodes there can be in a cluster, so the problem can be alleviated > but > > > may > > > > > > still be a problem if the data size goes bigger. > > > > > > > > > > > > 2.3) Using readonly copies of ovsdb replications. If > ovn-controller > > > > > > connects to readonly copies, we can deploy a big number of > > > ovsdb-servers of > > > > > > SB, which replicates from a common source - the read/write one > > > populated by > > > > > > ovn-northd. It can be a multi-layered (2 - 3 layer is big enough) > tree > > > > > > structure, so that each server only serves a small number of > clients. > > > > > > However, today there are some scenarios requires ovn-controller to > > > write > > > > > > data to SB, such as dynamic mac-binding (neighbor table > populating), > > > nb_cfg > > > > > > sync feature, etc. > > > > > > > > > > > > These ideas are not mutual exclusive, and the order is random just > > > > > > according to my thought process. I think most of them are worth to > > > try, but > > > > > > not sure about priority (except that 1.2 is almost out of question > > > since I > > > > > > don't think it is a good idea to do any architecture level change > at > > > this > > > > > > phase). Among the ideas, I think 1.3) 2.1) and 2.3) are the ones > that > > > > > > should have the best result (if they can be implemented with > > > reasonable > > > > > > effort). > > > > > > > > > > It sounds like reducing the size is essential, because you say that > the > > > > > sheer quantity of data is 5 minutes worth of raw bandwidth. Let's > go > > > > > through the options there. > > > > > > > > > > 1.1, using BSON instead of JSON, won't help sizewise. See > > > > > http://bsonspec.org/faq.html. > > > > > > > > > > 1.2 would change the OVN architecture, so I don't think it's a good > > > > > idea. > > > > > > > > > > 1.3, incremental data transfer, is an idea that Andy Zhou explored a > > > > > little bit before he left. There is some description of the > approach I > > > > > suggested in ovn/TODO.rst: > > > > > > > > > > * Reducing startup time. > > > > > > > > > > As-is, if ovsdb-server restarts, every client will fetch a fresh > > > copy of > > > > > the part of the database that it cares about. With hundreds of > > > clients, > > > > > this could cause heavy CPU load on ovsdb-server and use > excessive > > > network > > > > > bandwidth. It would be better to allow incremental updates even > > > across > > > > > connection loss. One way might be to use "Difference Digests" > as > > > described > > > > > in Epstein et al., "What's the Difference? Efficient Set > > > Reconciliation > > > > > Without Prior Context". (I'm not yet aware of previous > > > non-academic use of > > > > > this technique.) > > > > > > > > > > When Andy left VMware, the project got dropped, but it could be > picked > > > > > up again. > > > > > > > > > > There are other ways to implement incremental data transfer, too. > > > > > > > > > > Scaling out ovsdb-server is a good idea too, but I think it's > probably > > > > > less important for this particular case than reducing bandwidth > > > > > requirements > > > > > > > > > > 2.1, multithreading, is also something that Andy explored; again, > the > > > > > project would have to be resumed or restarted. > > > > > > > > Thanks Ben! It seems 1.3 (incremental data transfer) is the most > > > effective approach to solve this problem. > > > > I had a brief study on the "Difference Digests" paper. For my > > > understanding it is particularly useful when there is no prior context. > > > However, in OVSDB use case, especially in this OVN DB restart scenario, > we > > > do have the context about the last data received from the server. I > think > > > it would be more efficient (no full data scanning and encoding) and > maybe > > > simpler to implement based on the append-only (in most cases, for the > part > > > of data that hasn't been compressed yet) nature of OVSDB. Here is what I > > > have in mind: > > > > > > > With some more study, here are some more details added inline: > > > > > > > - We need a versioning mechanism. For each transaction record in OVSDB > > > file, it needs a unique version. The hash string may be used for this > > > purpose, so that the file format doesn't need to be changed. If we allow > > > the file format to be changed, it may be better to have a version number > > > that is sequentially increased. > > > > > > > For standalone format, the hash value can be used as unique version id. > > > (what's the chance of the 10 bytes hash value having a conflict?) > > > For clustered DB format, the eid is perfect for this purpose. > > > (Sequentially increasing seems not really necessary, since we only need > to > > > keep a small amount of history transactions for the DB restart scenario) > > > > > > > - We can include the latest version number in every OVSDB notification > > > from server to client, and the client IDL records the version. > > > > > > > > - When a client reconnects to server, it can request to get only the > > > changes after “last version”. > > > > > > > > - When a server starts up, it reads the DB file and keeps track of the > > > “version” for last N (e.g. 100) transactions, and maintains the changes > in > > > memory of that N transactions. > > > > > > The current implementation keeps track in monitors for the transactions > > > that haven't been flushed to all clients. We can extend this by keeping > > > track of extra N previous transactions. Now for the DB restart scenario, > > > the previous N transactions are read from DB file. > > > > > > Currently, during reading the DB file, there is no monitoring connected > > > yet, so replaying transactions will not trigger monitor data > population. We > > > can create a fake monitor for all tables beforehand so that we can reuse > > > the code to populate monitor data while replaying DB file transactions. > > > When a real monitor is created, it copies and converts the data in the > fake > > > monitor to its own table according to the monitor criteria. Flushing the > > > data for the fake monitor can be done so that maximumly M transactions > will > > > be kept in the monitor (because there is no real client to consume the > fake > > > monitor data). > > > > > > > > > > > - When a new client asks data after a given “version”, if the version > is > > > among the N transactions, the server sends the data afterwards to the > > > client. If the given version is not found (e.g. when client reconnect, > > > there is already lot of changes happened in server and the old changes > were > > > already flushed out, or a DB compression has been performed thus the > > > transaction data is gone), server can: > > > > - Option1: return an error telling the client the version is not > > > available, and client can re-request the whole snapshot > > > > - Option2: server directly send the whole snapshot, with a flag > > > indicating this is the whole snapshot instead of the delta > > > > > > > Option2 seems better. > > > > > > As to the OVSDB protocol change, since we need to add a version id to > every > > > update notification, it would be better to have a new method > > > "monitor_cond_since": > > > > > > "method": "monitor_cond_since" > > > "params": [<db-name>, <json-value>, <monitor-cond-requests>, > > > <latest_version_id>] > > > "id": <nonnull-json-value> > > > > > > <latest_version_id> is the version id that identifies the latest data > the > > > client already has. Everything else is same as "monitor_cond" method. > > > > > > The response object has the following members: > > > "result": [<found>, <latest_version_id>, <table-updates2>] > > > "error": null > > > "id": same "id" as request > > > > > > <found> is a boolean value that tells if the <latest_version_id> > requested > > > by client is found in history or not. If true, the data after that > version > > > up to current is sent. Otherwise, all data is sent. > > > <latest_version_id> is the version id that identifies the latest change > > > involved in this response, so that client can keep track. > > > Following changes will be notified to client using "update3" method: > > > > > > "method": "update3" > > > "params": [<json-value>, <latest_version_id>, <table-updates2>] > > > "id": null > > > > > > Similar as the response to "monitor_cond_since", <latest_version_id> is > > > added in the update3 method. > > > > > > > - Client will not destroy the old copy of data, unless the requested > > > version is not available and it has to reinitialize with the whole DB. > > > > > > > > This is less general than the Difference Digests approach, but I > think it > > > is sufficient and more relevant for OVSDB use case. I am sure there > will be > > > details need more consideration, e.g. OVSDB protocol update, caching in > > > server, etc., but do you think this is a right direction? > > > > > > Ben, please suggest if this is reasonable so that I could go ahead with > a > > > POC, or please let me know if you see obvious problems/pitfalls. > > > > I think that a lot of this analysis is on point. > > > > I'm pretty worried about the actual implementation. The monitor code is > > already hard to understand. I am not confident that there is a > > straightforward way to keep track of the last 100 transactions in a > > sensible way. (Bear in mind that different clients may monitor > > different portions of the database.) It might make sense to introduce > > a multiversioned data structure to keep track of the database > > content--maybe it would actually simplify some things. > > I understand the concern about the complexity. Thinking more about it, the > problem is only about DB restart, so we don't really need to keep track of > the last N transactions all the time. We only need to have them available > within a short period after the initial DB read. When all the old clients > are reconnected and requested the difference, these data are useless, and > going forward, new clients won't need any previous transactions either. So > instead of always keeping track of last N and maintaining a sliding-window, > we just need the initial part, which can be generated by using the "fake > monitor" approach I mentioned before, and then each new real monitor > (created after clients reconnecting) just selectively copy the data with a > *timestamp* added to the data entry, so that we can free them after a > certain amount of time (e.g. 5 minutes). For the entries that are NOT > copied from the fake monitor, which means they are new transactions, they > can be freed any time their predecessors are freed. The "fake monitor" > itself will have a timestamp, too, so that it can be deleted after the > initial period. Would this simplify the problem? (or maybe my description > makes it sounds more complex :)) > > Could you explain a little more about the multiversioned data structure > idea? I am not sure I understand it correctly.
Well, how do you plan to maintain the multiple copies of the database that will be necessary? Presumably each monitor needs a copy of a slightly different database. Or maybe I just don't understand your plan yet. > > If we do this, we need solid and thorough tests to ensure that it's > > reliable. It might make sense to start by thinking through the tests, > > rather than the implementation. > > Good point, I may start with tests first. _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
