On 14/07/2021 14:50, Ilya Maximets wrote: > Replication can be used to scale out read-only access to the database. > But there are clients that are not read-only, but read-mostly. > One of the main examples is ovn-controller that mostly monitors > updates from the Southbound DB, but needs to claim ports by sending > transactions that changes some database tables. > > Southbound database serves lots of connections: all connections > from ovn-controllers and some service connections from cloud > infrastructure, e.g. some OpenStack agents are monitoring updates. > At a high scale and with a big size of the database ovsdb-server > spends too much time processing monitor updates and it's required > to move this load somewhere else. This patch-set aims to introduce > required functionality to scale out read-mostly connections by > introducing a new OVSDB 'relay' service model . > > In this new service model ovsdb-server connects to existing OVSDB > server and maintains in-memory copy of the database. It serves > read-only transactions and monitor requests by its own, but forwards > write transactions to the relay source. > > Key differences from the active-backup replication: > - support for "write" transactions. > - no on-disk storage. (probably, faster operation) > - support for multiple remotes (connect to the clustered db). > - doesn't try to keep connection as long as possible, but > faster reconnects to other remotes to avoid missing updates. > - No need to know the complete database schema beforehand, > only the schema name. > - can be used along with other standalone and clustered databases > by the same ovsdb-server process. (doesn't turn the whole > jsonrpc server to read-only mode) > - supports modern version of monitors (monitor_cond_since), > because based on ovsdb-cs. > - could be chained, i.e. multiple relays could be connected > one to another in a row or in a tree-like form. > > Bringing all above functionality to the existing active-backup > replication doesn't look right as it will make it less reliable > for the actual backup use case, and this also would be much > harder from the implementation point of view, because current > replication code is not based on ovsdb-cs or idl and all the required > features would be likely duplicated or replication would be fully > re-written on top of ovsdb-cs with severe modifications of the former. > > Relay is somewhere in the middle between active-backup replication and > the clustered model taking a lot from both, therefore is hard to > implement on top of any of them. > > To run ovsdb-server in relay mode, user need to simply run: > > ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes> > > e.g. > > ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642 > > More details and examples in the documentation in the last patch > of the series. > > I actually tried to implement transaction forwarding on top of > active-backup replication in v1 of this seies, but it required > a lot of tricky changes, including schema format changes in order > to bring required information to the end clients, so I decided > to fully rewrite the functionality in v2 with a different approach. > > > Testing > ======= > > Some scale tests were performed with OVSDB Relays that mimics OVN > workloads with ovn-kubernetes. > Tests performed with ovn-heater (https://github.com/dceara/ovn-heater) > on scenario ocp-120-density-heavy: > > https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml > In short, the test gradually creates a lot of OVN resources and > checks that network is configured correctly (by pinging diferent > namespaces). The test includes 120 chassis (created by > ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs > with 15625 VIPs each, attached to all node LSes, etc. Test performed > with monitor-all=true. > > Note 1: > - Memory consumption is checked at the end of a test in a following > way: 1) check RSS 2) compact database 3) check RSS again. > It's observed that ovn-controllers in this test are fairly slow > and backlog builds up on monitors, because ovn-controllers are > not able to receive updates fast enough. This contributes to > RSS of the process, especially in combination of glibc bug (glibc > doesn't free fastbins back to the system). Memory trimming on > compaction is enabled in the test, so after compaction we can > see more or less real value of the RSS at the end of the test > wihtout backlog noise. (Compaction on relay in this case is > just plain malloc_trim()). > > Note 2: > - I didn't collect memory consumption (RSS) after compaction for a > test with 10 relays, because I got the idea only after the test > was finished and another one already started. And run takes > significant amount of time. So, values marked with a star (*) > are an approximation based on results form other tests, hence > might be not fully correct. > > Note 3: > - 'Max. poll' is a maximum of the 'long poll intervals' logged by > ovsdb-server during the test. Poll intervals that involved database > compaction (huge disk writes) are same in all tests and excluded > from the results. (Sb DB size in the test is 256MB, fully > compacted). 'Number of intervals' is just a number of logged > unreasonably long poll intervals. > Also note that ovsdb-server logs only compactions that took > 1s, > so poll intervals that involved compaction, but under 1s can not > be reliably excluded from the test results. > 'central' - main Sb DB servers. > 'relay' - relay servers connected to central ones. > 'before'/'after' - RSS before and after compaction + malloc_trim(). > 'time' - is a total time the process spent in Running state. > > > Baseline (3 main servers, 0 relays): > ++++++++++++++++++++++++++++++++++++++++ > > RSS > central before after clients time Max. poll Number of intervals > 7552924 3828848 ~41 109:50 5882 1249 > 7342468 4109576 ~43 108:37 5717 1169 > 5886260 4109496 ~39 96:31 4990 1233 > --------------------------------------------------------------------- > 20G 12G 126 314:58 5882 3651 > > 3x3 (3 main servers, 3 relays): > +++++++++++++++++++++++++++++++ > > RSS > central before after clients time Max. poll Number of intervals > 6228176 3542164 ~1-5 36:53 2174 358 > 5723920 3570616 ~1-5 24:03 2205 382 > 5825420 3490840 ~1-5 35:42 2214 309 > --------------------------------------------------------------------- > 17.7G 10.6G 9 96:38 2214 1049 > > relay before after clients time Max. poll Number of intervals > 2174328 726576 37 69:44 5216 627 > 2122144 729640 32 63:52 4767 625 > 2824160 751384 51 89:09 5980 627 > --------------------------------------------------------------------- > 7G 2.2G 120 222:45 5980 1879 > > Total: ===================================================================== > 24.7G 12.8G 129 319:23 5980 2928 > > 3x10 (3 main servers, 10 relays): > +++++++++++++++++++++++++++++++++ > > RSS > central before after clients time Max. poll Number of intervals > 6190892 --- ~1-6 42:43 2041 634 > 5687576 --- ~1-5 27:09 2503 405 > 5958432 --- ~1-7 40:44 2193 450 > --------------------------------------------------------------------- > 17.8G ~10G* 16 110:36 2503 1489 > > relay before after clients time Max. poll Number of intervals > 1331256 --- 9 22:58 1327 140 > 1218288 --- 13 28:28 1840 621 > 1507644 --- 19 41:44 2869 623 > 1257692 --- 12 27:40 1532 517 > 1125368 --- 9 22:23 1148 105 > 1380664 --- 16 35:04 2422 619 > 1087248 --- 6 18:18 1038 6 > 1277484 --- 14 34:02 2392 616 > 1209936 --- 10 25:31 1603 451 > 1293092 --- 12 29:03 2071 621 > --------------------------------------------------------------------- > 12.6G 5-7G* 120 285:11 2869 4319 > > Total: ===================================================================== > 30.4G 15-17G* 136 395:47 2869 5808 > >
Thanks for running these and sharing the data. It looks promising and is not a hugely intrusive change in the code base so it looks good to me Acked-by: Mark D. Gray <mark.d.g...@redhat.com> > Conclusions from the test: > ========================== > > 1. Relays relieve a lot of pressure from main Sb DB servers. > In my testing total CPU time on main servers goes down from 314 > to 96-110 minutes, which is 3 times lower. > During the test, number of registered 'unreasonably poll interval's > on main servers goes down by 3-4 times. At the same time the > maximum duration of these intervals goes down by a factor of 2.5. > Also, factor should be higher with increased number of clinents. > > 2. Since number of clients is significantly lower, memory consumption > of main Db DB servers also goes down by ~12%. > > 3. For the 3x3 test total memory consumed by all processes increased > only by 6%. And total CPU usage increased by 1.2%. Poll intervals > on relay servers are comparable to poll intervals on main servers > with no relays, but poll intervals on main servers are significantly > better (see conclusion # 1). In general, it seems that for this > test running of 3 relays next to 3 main Sb DB servers significanlty > increases cluster stability and responsiveness without noticeable > increase in memory or CPU usage. > > 4. For the 3x10 test total memory consumed by all processes increased > by ~50-70%*. And total CPU usage increased by 26% in compare with > baseline setup. At the same time poll intervals on both main > and relay servers are lower by a factor of 2-4 (depends on a > particular server). In general, cluster with 10 relays is much more > stable and responsive with a reasonably low memory consumption and > CPU time overhead. > > > > Future work: > - Add support for transaction history (it could be just inherited > from the transaction ids received from the relay source). This > will allow clients to utilize monitor_cond_since while working > with relay. > - Possibly try to inherit min_index from the relay source to give > clients ability to detect relays with stale data. > - Probably, add support for both above things to standalone databases, > so relays will be able to inherit not only from clustered ones. > > > Version 3: > - Fixed issue with incorrect schema equality check. > - Fixed transaction leak if inconsistent data received from the > source. > - Minor fixes for style, wording and typos. > > Version 2: > - Dropped implementation on top of active-backup replication. > - Implemented new 'relay' service model. > - Updated documentation and wrote a separate topic with examples > and ascii-graphics. That's why v2 seem larger. > > Ilya Maximets (9): > jsonrpc-server: Wake up jsonrpc session if there are completed > triggers. > ovsdb: storage: Allow setting the name for the unbacked storage. > ovsdb: table: Expose functions to execute operations on ovsdb tables. > ovsdb: row: Add support for xor-based row updates. > ovsdb: New ovsdb 'relay' service model. > ovsdb: relay: Add support for transaction forwarding. > ovsdb: relay: Reflect connection status in _Server database. > ovsdb: Make clients aware of relay service model. > docs: Add documentation for ovsdb relay mode. > > Documentation/automake.mk | 1 + > Documentation/ref/ovsdb.7.rst | 62 ++++- > Documentation/topics/index.rst | 1 + > Documentation/topics/ovsdb-relay.rst | 124 +++++++++ > NEWS | 3 + > lib/ovsdb-cs.c | 15 +- > ovsdb/_server.ovsschema | 7 +- > ovsdb/_server.xml | 35 +-- > ovsdb/automake.mk | 4 + > ovsdb/execution.c | 18 +- > ovsdb/file.c | 2 +- > ovsdb/jsonrpc-server.c | 3 +- > ovsdb/ovsdb-client.c | 2 +- > ovsdb/ovsdb-server.1.in | 27 +- > ovsdb/ovsdb-server.c | 105 +++++--- > ovsdb/ovsdb.c | 11 + > ovsdb/ovsdb.h | 9 +- > ovsdb/relay.c | 385 +++++++++++++++++++++++++++ > ovsdb/relay.h | 38 +++ > ovsdb/replication.c | 83 +----- > ovsdb/row.c | 30 ++- > ovsdb/row.h | 6 +- > ovsdb/storage.c | 13 +- > ovsdb/storage.h | 2 +- > ovsdb/table.c | 70 +++++ > ovsdb/table.h | 14 + > ovsdb/transaction-forward.c | 182 +++++++++++++ > ovsdb/transaction-forward.h | 44 +++ > ovsdb/trigger.c | 49 +++- > ovsdb/trigger.h | 41 +-- > python/ovs/db/idl.py | 16 ++ > tests/ovsdb-server.at | 85 +++++- > tests/test-ovsdb.c | 6 +- > 33 files changed, 1297 insertions(+), 196 deletions(-) > create mode 100644 Documentation/topics/ovsdb-relay.rst > create mode 100644 ovsdb/relay.c > create mode 100644 ovsdb/relay.h > create mode 100644 ovsdb/transaction-forward.c > create mode 100644 ovsdb/transaction-forward.h > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev