Hi Ilya, On 7/14/21 6:52 PM, Ilya Maximets wrote: > On 7/14/21 3:50 PM, Ilya Maximets wrote: >> Replication can be used to scale out read-only access to the database. >> But there are clients that are not read-only, but read-mostly. >> One of the main examples is ovn-controller that mostly monitors >> updates from the Southbound DB, but needs to claim ports by sending >> transactions that changes some database tables. >> >> Southbound database serves lots of connections: all connections >> from ovn-controllers and some service connections from cloud >> infrastructure, e.g. some OpenStack agents are monitoring updates. >> At a high scale and with a big size of the database ovsdb-server >> spends too much time processing monitor updates and it's required >> to move this load somewhere else. This patch-set aims to introduce >> required functionality to scale out read-mostly connections by >> introducing a new OVSDB 'relay' service model . >> >> In this new service model ovsdb-server connects to existing OVSDB >> server and maintains in-memory copy of the database. It serves >> read-only transactions and monitor requests by its own, but forwards >> write transactions to the relay source. >> >> Key differences from the active-backup replication: >> - support for "write" transactions. >> - no on-disk storage. (probably, faster operation) >> - support for multiple remotes (connect to the clustered db). >> - doesn't try to keep connection as long as possible, but >> faster reconnects to other remotes to avoid missing updates. >> - No need to know the complete database schema beforehand, >> only the schema name. >> - can be used along with other standalone and clustered databases >> by the same ovsdb-server process. (doesn't turn the whole >> jsonrpc server to read-only mode) >> - supports modern version of monitors (monitor_cond_since), >> because based on ovsdb-cs. >> - could be chained, i.e. multiple relays could be connected >> one to another in a row or in a tree-like form. >> >> Bringing all above functionality to the existing active-backup >> replication doesn't look right as it will make it less reliable >> for the actual backup use case, and this also would be much >> harder from the implementation point of view, because current >> replication code is not based on ovsdb-cs or idl and all the required >> features would be likely duplicated or replication would be fully >> re-written on top of ovsdb-cs with severe modifications of the former. >> >> Relay is somewhere in the middle between active-backup replication and >> the clustered model taking a lot from both, therefore is hard to >> implement on top of any of them. >> >> To run ovsdb-server in relay mode, user need to simply run: >> >> ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes> >> >> e.g. >> >> ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642 >> >> More details and examples in the documentation in the last patch >> of the series. >> >> I actually tried to implement transaction forwarding on top of >> active-backup replication in v1 of this seies, but it required >> a lot of tricky changes, including schema format changes in order >> to bring required information to the end clients, so I decided >> to fully rewrite the functionality in v2 with a different approach. >> >> >> Testing >> ======= >> >> Some scale tests were performed with OVSDB Relays that mimics OVN >> workloads with ovn-kubernetes. >> Tests performed with ovn-heater (https://github.com/dceara/ovn-heater) >> on scenario ocp-120-density-heavy: >> >> https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml >> In short, the test gradually creates a lot of OVN resources and >> checks that network is configured correctly (by pinging diferent >> namespaces). The test includes 120 chassis (created by >> ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs >> with 15625 VIPs each, attached to all node LSes, etc. Test performed >> with monitor-all=true. >> >> Note 1: >> - Memory consumption is checked at the end of a test in a following >> way: 1) check RSS 2) compact database 3) check RSS again. >> It's observed that ovn-controllers in this test are fairly slow >> and backlog builds up on monitors, because ovn-controllers are >> not able to receive updates fast enough. This contributes to >> RSS of the process, especially in combination of glibc bug (glibc >> doesn't free fastbins back to the system). Memory trimming on >> compaction is enabled in the test, so after compaction we can >> see more or less real value of the RSS at the end of the test >> wihtout backlog noise. (Compaction on relay in this case is >> just plain malloc_trim()). >> >> Note 2: >> - I didn't collect memory consumption (RSS) after compaction for a >> test with 10 relays, because I got the idea only after the test >> was finished and another one already started. And run takes >> significant amount of time. So, values marked with a star (*) >> are an approximation based on results form other tests, hence >> might be not fully correct. >> >> Note 3: >> - 'Max. poll' is a maximum of the 'long poll intervals' logged by >> ovsdb-server during the test. Poll intervals that involved database >> compaction (huge disk writes) are same in all tests and excluded >> from the results. (Sb DB size in the test is 256MB, fully >> compacted). 'Number of intervals' is just a number of logged >> unreasonably long poll intervals. >> Also note that ovsdb-server logs only compactions that took > 1s, >> so poll intervals that involved compaction, but under 1s can not >> be reliably excluded from the test results. >> 'central' - main Sb DB servers. >> 'relay' - relay servers connected to central ones. >> 'before'/'after' - RSS before and after compaction + malloc_trim(). >> 'time' - is a total time the process spent in Running state. >> >> >> Baseline (3 main servers, 0 relays): >> ++++++++++++++++++++++++++++++++++++++++ >> >> RSS >> central before after clients time Max. poll Number of intervals >> 7552924 3828848 ~41 109:50 5882 1249 >> 7342468 4109576 ~43 108:37 5717 1169 >> 5886260 4109496 ~39 96:31 4990 1233 >> >> --------------------------------------------------------------------- >> 20G 12G 126 314:58 5882 3651 >> >> 3x3 (3 main servers, 3 relays): >> +++++++++++++++++++++++++++++++ >> >> RSS >> central before after clients time Max. poll Number of intervals >> 6228176 3542164 ~1-5 36:53 2174 358 >> 5723920 3570616 ~1-5 24:03 2205 382 >> 5825420 3490840 ~1-5 35:42 2214 309 >> >> --------------------------------------------------------------------- >> 17.7G 10.6G 9 96:38 2214 1049 >> >> relay before after clients time Max. poll Number of intervals >> 2174328 726576 37 69:44 5216 627 >> 2122144 729640 32 63:52 4767 625 >> 2824160 751384 51 89:09 5980 627 >> >> --------------------------------------------------------------------- >> 7G 2.2G 120 222:45 5980 1879 >> >> Total: >> ===================================================================== >> 24.7G 12.8G 129 319:23 5980 2928 >> >> 3x10 (3 main servers, 10 relays): >> +++++++++++++++++++++++++++++++++ >> >> RSS >> central before after clients time Max. poll Number of intervals >> 6190892 --- ~1-6 42:43 2041 634 >> 5687576 --- ~1-5 27:09 2503 405 >> 5958432 --- ~1-7 40:44 2193 450 >> >> --------------------------------------------------------------------- >> 17.8G ~10G* 16 110:36 2503 1489 >> >> relay before after clients time Max. poll Number of intervals >> 1331256 --- 9 22:58 1327 140 >> 1218288 --- 13 28:28 1840 621 >> 1507644 --- 19 41:44 2869 623 >> 1257692 --- 12 27:40 1532 517 >> 1125368 --- 9 22:23 1148 105 >> 1380664 --- 16 35:04 2422 619 >> 1087248 --- 6 18:18 1038 6 >> 1277484 --- 14 34:02 2392 616 >> 1209936 --- 10 25:31 1603 451 >> 1293092 --- 12 29:03 2071 621 >> >> --------------------------------------------------------------------- >> 12.6G 5-7G* 120 285:11 2869 4319 >> >> Total: >> ===================================================================== >> 30.4G 15-17G* 136 395:47 2869 5808
This is very cool, thanks for taking the time to share all this data! >> >> >> Conclusions from the test: >> ========================== >> >> 1. Relays relieve a lot of pressure from main Sb DB servers. >> In my testing total CPU time on main servers goes down from 314 >> to 96-110 minutes, which is 3 times lower. >> During the test, number of registered 'unreasonably poll interval's >> on main servers goes down by 3-4 times. At the same time the >> maximum duration of these intervals goes down by a factor of 2.5. >> Also, factor should be higher with increased number of clinents. >> >> 2. Since number of clients is significantly lower, memory consumption >> of main Db DB servers also goes down by ~12%. >> >> 3. For the 3x3 test total memory consumed by all processes increased >> only by 6%. And total CPU usage increased by 1.2%. Poll intervals >> on relay servers are comparable to poll intervals on main servers >> with no relays, but poll intervals on main servers are significantly >> better (see conclusion # 1). In general, it seems that for this >> test running of 3 relays next to 3 main Sb DB servers significanlty >> increases cluster stability and responsiveness without noticeable >> increase in memory or CPU usage. >> >> 4. For the 3x10 test total memory consumed by all processes increased >> by ~50-70%*. And total CPU usage increased by 26% in compare with > > ~50-70%* should be ~25-40%*. I miscalculated because used 10G from 3x3 > test instead of 12G from the baseline. > >> baseline setup. At the same time poll intervals on both main >> and relay servers are lower by a factor of 2-4 (depends on a >> particular server). In general, cluster with 10 relays is much more >> stable and responsive with a reasonably low memory consumption and >> CPU time overhead. >> >> Nice! >> >> Future work: >> - Add support for transaction history (it could be just inherited >> from the transaction ids received from the relay source). This >> will allow clients to utilize monitor_cond_since while working >> with relay. >> - Possibly try to inherit min_index from the relay source to give >> clients ability to detect relays with stale data. >> - Probably, add support for both above things to standalone databases, >> so relays will be able to inherit not only from clustered ones. Nit: I don't think this should block the series but I think the above should be added to ovsdb/TODO.rst in a follow up patch. I just acked the single patch I hadn't acked in v2 (7/9) and left a minor comment on 5/9 (which can be fixed at apply time). The series looks good to me. Regards, Dumitru _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
