Replication can be used to scale out read-only access to the database. But there are clients that are not read-only, but read-mostly. One of the main examples is ovn-controller that mostly monitors updates from the Southbound DB, but needs to claim ports by sending transactions that changes some database tables.
Southbound database serves lots of connections: all connections from ovn-controllers and some service connections from cloud infrastructure, e.g. some OpenStack agents are monitoring updates. At a high scale and with a big size of the database ovsdb-server spends too much time processing monitor updates and it's required to move this load somewhere else. This patch-set aims to introduce required functionality to scale out read-mostly connections by introducing a new OVSDB 'relay' service model . In this new service model ovsdb-server connects to existing OVSDB server and maintains in-memory copy of the database. It serves read-only transactions and monitor requests by its own, but forwards write transactions to the relay source. Key differences from the active-backup replication: - support for "write" transactions. - no on-disk storage. (probably, faster operation) - support for multiple remotes (connect to the clustered db). - doesn't try to keep connection as long as possible, but faster reconnects to other remotes to avoid missing updates. - No need to know the complete database schema beforehand, only the schema name. - can be used along with other standalone and clustered databases by the same ovsdb-server process. (doesn't turn the whole jsonrpc server to read-only mode) - supports modern version of monitors (monitor_cond_since), because based on ovsdb-cs. - could be chained, i.e. multiple relays could be connected one to another in a row or in a tree-like form. Bringing all above functionality to the existing active-backup replication doesn't look right as it will make it less reliable for the actual backup use case, and this also would be much harder from the implementation point of view, because current replication code is not based on ovsdb-cs or idl and all the required features would be likely duplicated or replication would be fully re-written on top of ovsdb-cs with severe modifications of the former. Relay is somewhere in the middle between active-backup replication and the clustered model taking a lot from both, therefore is hard to implement on top of any of them. To run ovsdb-server in relay mode, user need to simply run: ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes> e.g. ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642 More details and examples in the documentation in the last patch of the series. I actually tried to implement transaction forwarding on top of active-backup replication in v1 of this seies, but it required a lot of tricky changes, including schema format changes in order to bring required information to the end clients, so I decided to fully rewrite the functionality in v2 with a different approach. Testing ======= Some scale tests were performed with OVSDB Relays that mimics OVN workloads with ovn-kubernetes. Tests performed with ovn-heater (https://github.com/dceara/ovn-heater) on scenario ocp-120-density-heavy: https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml In short, the test gradually creates a lot of OVN resources and checks that network is configured correctly (by pinging diferent namespaces). The test includes 120 chassis (created by ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs with 15625 VIPs each, attached to all node LSes, etc. Test performed with monitor-all=true. Note 1: - Memory consumption is checked at the end of a test in a following way: 1) check RSS 2) compact database 3) check RSS again. It's observed that ovn-controllers in this test are fairly slow and backlog builds up on monitors, because ovn-controllers are not able to receive updates fast enough. This contributes to RSS of the process, especially in combination of glibc bug (glibc doesn't free fastbins back to the system). Memory trimming on compaction is enabled in the test, so after compaction we can see more or less real value of the RSS at the end of the test wihtout backlog noise. (Compaction on relay in this case is just plain malloc_trim()). Note 2: - I didn't collect memory consumption (RSS) after compaction for a test with 10 relays, because I got the idea only after the test was finished and another one already started. And run takes significant amount of time. So, values marked with a star (*) are an approximation based on results form other tests, hence might be not fully correct. Note 3: - 'Max. poll' is a maximum of the 'long poll intervals' logged by ovsdb-server during the test. Poll intervals that involved database compaction (huge disk writes) are same in all tests and excluded from the results. (Sb DB size in the test is 256MB, fully compacted). 'Number of intervals' is just a number of logged unreasonably long poll intervals. Also note that ovsdb-server logs only compactions that took > 1s, so poll intervals that involved compaction, but under 1s can not be reliably excluded from the test results. 'central' - main Sb DB servers. 'relay' - relay servers connected to central ones. 'before'/'after' - RSS before and after compaction + malloc_trim(). 'time' - is a total time the process spent in Running state. Baseline (3 main servers, 0 relays): ++++++++++++++++++++++++++++++++++++++++ RSS central before after clients time Max. poll Number of intervals 7552924 3828848 ~41 109:50 5882 1249 7342468 4109576 ~43 108:37 5717 1169 5886260 4109496 ~39 96:31 4990 1233 --------------------------------------------------------------------- 20G 12G 126 314:58 5882 3651 3x3 (3 main servers, 3 relays): +++++++++++++++++++++++++++++++ RSS central before after clients time Max. poll Number of intervals 6228176 3542164 ~1-5 36:53 2174 358 5723920 3570616 ~1-5 24:03 2205 382 5825420 3490840 ~1-5 35:42 2214 309 --------------------------------------------------------------------- 17.7G 10.6G 9 96:38 2214 1049 relay before after clients time Max. poll Number of intervals 2174328 726576 37 69:44 5216 627 2122144 729640 32 63:52 4767 625 2824160 751384 51 89:09 5980 627 --------------------------------------------------------------------- 7G 2.2G 120 222:45 5980 1879 Total: ===================================================================== 24.7G 12.8G 129 319:23 5980 2928 3x10 (3 main servers, 10 relays): +++++++++++++++++++++++++++++++++ RSS central before after clients time Max. poll Number of intervals 6190892 --- ~1-6 42:43 2041 634 5687576 --- ~1-5 27:09 2503 405 5958432 --- ~1-7 40:44 2193 450 --------------------------------------------------------------------- 17.8G ~10G* 16 110:36 2503 1489 relay before after clients time Max. poll Number of intervals 1331256 --- 9 22:58 1327 140 1218288 --- 13 28:28 1840 621 1507644 --- 19 41:44 2869 623 1257692 --- 12 27:40 1532 517 1125368 --- 9 22:23 1148 105 1380664 --- 16 35:04 2422 619 1087248 --- 6 18:18 1038 6 1277484 --- 14 34:02 2392 616 1209936 --- 10 25:31 1603 451 1293092 --- 12 29:03 2071 621 --------------------------------------------------------------------- 12.6G 5-7G* 120 285:11 2869 4319 Total: ===================================================================== 30.4G 15-17G* 136 395:47 2869 5808 Conclusions from the test: ========================== 1. Relays relieve a lot of pressure from main Sb DB servers. In my testing total CPU time on main servers goes down from 314 to 96-110 minutes, which is 3 times lower. During the test, number of registered 'unreasonably poll interval's on main servers goes down by 3-4 times. At the same time the maximum duration of these intervals goes down by a factor of 2.5. Also, factor should be higher with increased number of clinents. 2. Since number of clients is significantly lower, memory consumption of main Db DB servers also goes down by ~12%. 3. For the 3x3 test total memory consumed by all processes increased only by 6%. And total CPU usage increased by 1.2%. Poll intervals on relay servers are comparable to poll intervals on main servers with no relays, but poll intervals on main servers are significantly better (see conclusion # 1). In general, it seems that for this test running of 3 relays next to 3 main Sb DB servers significanlty increases cluster stability and responsiveness without noticeable increase in memory or CPU usage. 4. For the 3x10 test total memory consumed by all processes increased by ~50-70%*. And total CPU usage increased by 26% in compare with baseline setup. At the same time poll intervals on both main and relay servers are lower by a factor of 2-4 (depends on a particular server). In general, cluster with 10 relays is much more stable and responsive with a reasonably low memory consumption and CPU time overhead. Future work: - Add support for transaction history (it could be just inherited from the transaction ids received from the relay source). This will allow clients to utilize monitor_cond_since while working with relay. - Possibly try to inherit min_index from the relay source to give clients ability to detect relays with stale data. - Probably, add support for both above things to standalone databases, so relays will be able to inherit not only from clustered ones. Version 3: - Fixed issue with incorrect schema equality check. - Fixed transaction leak if inconsistent data received from the source. - Minor fixes for style, wording and typos. Version 2: - Dropped implementation on top of active-backup replication. - Implemented new 'relay' service model. - Updated documentation and wrote a separate topic with examples and ascii-graphics. That's why v2 seem larger. Ilya Maximets (9): jsonrpc-server: Wake up jsonrpc session if there are completed triggers. ovsdb: storage: Allow setting the name for the unbacked storage. ovsdb: table: Expose functions to execute operations on ovsdb tables. ovsdb: row: Add support for xor-based row updates. ovsdb: New ovsdb 'relay' service model. ovsdb: relay: Add support for transaction forwarding. ovsdb: relay: Reflect connection status in _Server database. ovsdb: Make clients aware of relay service model. docs: Add documentation for ovsdb relay mode. Documentation/automake.mk | 1 + Documentation/ref/ovsdb.7.rst | 62 ++++- Documentation/topics/index.rst | 1 + Documentation/topics/ovsdb-relay.rst | 124 +++++++++ NEWS | 3 + lib/ovsdb-cs.c | 15 +- ovsdb/_server.ovsschema | 7 +- ovsdb/_server.xml | 35 +-- ovsdb/automake.mk | 4 + ovsdb/execution.c | 18 +- ovsdb/file.c | 2 +- ovsdb/jsonrpc-server.c | 3 +- ovsdb/ovsdb-client.c | 2 +- ovsdb/ovsdb-server.1.in | 27 +- ovsdb/ovsdb-server.c | 105 +++++--- ovsdb/ovsdb.c | 11 + ovsdb/ovsdb.h | 9 +- ovsdb/relay.c | 385 +++++++++++++++++++++++++++ ovsdb/relay.h | 38 +++ ovsdb/replication.c | 83 +----- ovsdb/row.c | 30 ++- ovsdb/row.h | 6 +- ovsdb/storage.c | 13 +- ovsdb/storage.h | 2 +- ovsdb/table.c | 70 +++++ ovsdb/table.h | 14 + ovsdb/transaction-forward.c | 182 +++++++++++++ ovsdb/transaction-forward.h | 44 +++ ovsdb/trigger.c | 49 +++- ovsdb/trigger.h | 41 +-- python/ovs/db/idl.py | 16 ++ tests/ovsdb-server.at | 85 +++++- tests/test-ovsdb.c | 6 +- 33 files changed, 1297 insertions(+), 196 deletions(-) create mode 100644 Documentation/topics/ovsdb-relay.rst create mode 100644 ovsdb/relay.c create mode 100644 ovsdb/relay.h create mode 100644 ovsdb/transaction-forward.c create mode 100644 ovsdb/transaction-forward.h -- 2.31.1 _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
