Replication can be used to scale out read-only access to the database.
But there are clients that are not read-only, but read-mostly.
One of the main examples is ovn-controller that mostly monitors
updates from the Southbound DB, but needs to claim ports by sending
transactions that changes some database tables.

Southbound database serves lots of connections: all connections
from ovn-controllers and some service connections from cloud
infrastructure, e.g. some OpenStack agents are monitoring updates.
At a high scale and with a big size of the database ovsdb-server
spends too much time processing monitor updates and it's required
to move this load somewhere else.  This patch-set aims to introduce
required functionality to scale out read-mostly connections by
introducing a new OVSDB 'relay' service model .

In this new service model ovsdb-server connects to existing OVSDB
server and maintains in-memory copy of the database.  It serves
read-only transactions and monitor requests by its own, but forwards
write transactions to the relay source.

Key differences from the active-backup replication:
- support for "write" transactions.
- no on-disk storage. (probably, faster operation)
- support for multiple remotes (connect to the clustered db).
- doesn't try to keep connection as long as possible, but
  faster reconnects to other remotes to avoid missing updates.
- No need to know the complete database schema beforehand,
  only the schema name.
- can be used along with other standalone and clustered databases
  by the same ovsdb-server process. (doesn't turn the whole
  jsonrpc server to read-only mode)
- supports modern version of monitors (monitor_cond_since),
  because based on ovsdb-cs.
- could be chained, i.e. multiple relays could be connected
  one to another in a row or in a tree-like form.

Bringing all above functionality to the existing active-backup
replication doesn't look right as it will make it less reliable
for the actual backup use case, and this also would be much
harder from the implementation point of view, because current
replication code is not based on ovsdb-cs or idl and all the required
features would be likely duplicated or replication would be fully
re-written on top of ovsdb-cs with severe modifications of the former.

Relay is somewhere in the middle between active-backup replication and
the clustered model taking a lot from both, therefore is hard to
implement on top of any of them.

To run ovsdb-server in relay mode, user need to simply run:

  ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes>

e.g.

  ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642

More details and examples in the documentation in the last patch
of the series.

I actually tried to implement transaction forwarding on top of
active-backup replication in v1 of this seies, but it required
a lot of tricky changes, including schema format changes in order
to bring required information to the end clients, so I decided
to fully rewrite the functionality in v2 with a different approach.


 Testing
 =======

Some scale tests were performed with OVSDB Relays that mimics OVN
workloads with ovn-kubernetes.
Tests performed with ovn-heater (https://github.com/dceara/ovn-heater)
on scenario ocp-120-density-heavy:
 
https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml
In short, the test gradually creates a lot of OVN resources and
checks that network is configured correctly (by pinging diferent
namespaces).  The test includes 120 chassis (created by
ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs
with 15625 VIPs each, attached to all node LSes, etc.  Test performed
with monitor-all=true.

Note 1:
 - Memory consumption is checked at the end of a test in a following
   way: 1) check RSS 2) compact database 3) check RSS again.
   It's observed that ovn-controllers in this test are fairly slow
   and backlog builds up on monitors, because ovn-controllers are
   not able to receive updates fast enough.  This contributes to
   RSS of the process, especially in combination of glibc bug (glibc
   doesn't free fastbins back to the system).  Memory trimming on
   compaction is enabled in the test, so after compaction we can
   see more or less real value of the RSS at the end of the test
   wihtout backlog noise. (Compaction on relay in this case is
   just plain malloc_trim()).

Note 2:
 - I didn't collect memory consumption (RSS) after compaction for a
   test with 10 relays, because I got the idea only after the test
   was finished and another one already started.  And run takes
   significant amount of time.  So, values marked with a star (*)
   are an approximation based on results form other tests, hence
   might be not fully correct.

Note 3:
 - 'Max. poll' is a maximum of the 'long poll intervals' logged by
   ovsdb-server during the test.  Poll intervals that involved database
   compaction (huge disk writes) are same in all tests and excluded
   from the results.  (Sb DB size in the test is 256MB, fully
   compacted).  'Number of intervals' is just a number of logged
   unreasonably long poll intervals.
   Also note that ovsdb-server logs only compactions that took > 1s,
   so poll intervals that involved compaction, but under 1s can not
   be reliably excluded from the test results.
   'central' - main Sb DB servers.
   'relay'   - relay servers connected to central ones.
   'before'/'after' - RSS before and after compaction + malloc_trim().
   'time' - is a total time the process spent in Running state.


Baseline (3 main servers, 0 relays):
++++++++++++++++++++++++++++++++++++++++

               RSS
central  before    after    clients  time     Max. poll   Number of intervals
         7552924   3828848   ~41     109:50   5882        1249
         7342468   4109576   ~43     108:37   5717        1169
         5886260   4109496   ~39      96:31   4990        1233
         ---------------------------------------------------------------------
             20G       12G   126     314:58   5882        3651

3x3 (3 main servers, 3 relays):
+++++++++++++++++++++++++++++++

                RSS
central  before    after    clients  time     Max. poll   Number of intervals
         6228176   3542164   ~1-5    36:53    2174        358
         5723920   3570616   ~1-5    24:03    2205        382
         5825420   3490840   ~1-5    35:42    2214        309
         ---------------------------------------------------------------------
           17.7G     10.6G      9    96:38    2214        1049

relay    before    after    clients  time     Max. poll   Number of intervals
         2174328    726576    37     69:44    5216        627
         2122144    729640    32     63:52    4767        625
         2824160    751384    51     89:09    5980        627
         ---------------------------------------------------------------------
              7G      2.2G    120   222:45    5980        1879

Total:   =====================================================================
           24.7G     12.8G    129    319:23   5980        2928

3x10 (3 main servers, 10 relays):
+++++++++++++++++++++++++++++++++

               RSS
central  before    after    clients  time    Max. poll   Number of intervals
         6190892    ---      ~1-6    42:43   2041         634
         5687576    ---      ~1-5    27:09   2503         405
         5958432    ---      ~1-7    40:44   2193         450
         ---------------------------------------------------------------------
           17.8G   ~10G*       16   110:36   2503         1489

relay    before    after    clients  time    Max. poll   Number of intervals
         1331256    ---       9      22:58   1327         140
         1218288    ---      13      28:28   1840         621
         1507644    ---      19      41:44   2869         623
         1257692    ---      12      27:40   1532         517
         1125368    ---       9      22:23   1148         105
         1380664    ---      16      35:04   2422         619
         1087248    ---       6      18:18   1038           6
         1277484    ---      14      34:02   2392         616
         1209936    ---      10      25:31   1603         451
         1293092    ---      12      29:03   2071         621
         ---------------------------------------------------------------------
           12.6G    5-7G*    120    285:11   2869         4319

Total:   =====================================================================
           30.4G    15-17G*  136    395:47   2869         5808


 Conclusions from the test:
 ==========================

1. Relays relieve a lot of pressure from main Sb DB servers.
   In my testing total CPU time on main servers goes down from 314
   to 96-110 minutes, which is 3 times lower.
   During the test, number of registered 'unreasonably poll interval's
   on main servers goes down by 3-4 times.  At the same time the
   maximum duration of these intervals goes down by a factor of 2.5.
   Also, factor should be higher with increased number of clinents.

2. Since number of clients is significantly lower, memory consumption
   of main Db DB servers also goes down by ~12%.

3. For the 3x3 test total memory consumed by all processes increased
   only by 6%.  And total CPU usage increased by 1.2%.  Poll intervals
   on relay servers are comparable to poll intervals on main servers
   with no relays, but poll intervals on main servers are significantly
   better (see conclusion # 1).  In general, it seems that for this
   test running of 3 relays next to 3 main Sb DB servers significanlty
   increases cluster stability and responsiveness without noticeable
   increase in memory or CPU usage.

4. For the 3x10 test total memory consumed by all processes increased
   by ~50-70%*.  And total CPU usage increased by 26% in compare with
   baseline setup.  At the same time poll intervals on both main
   and relay servers are lower by a factor of 2-4 (depends on a
   particular server).  In general, cluster with 10 relays is much more
   stable and responsive with a reasonably low memory consumption and
   CPU time overhead.



Future work:
- Add support for transaction history (it could be just inherited
  from the transaction ids received from the relay source).  This
  will allow clients to utilize monitor_cond_since while working
  with relay.
- Possibly try to inherit min_index from the relay source to give
  clients ability to detect relays with stale data.
- Probably, add support for both above things to standalone databases,
  so relays will be able to inherit not only from clustered ones.


Version 3:
  - Fixed issue with incorrect schema equality check.
  - Fixed transaction leak if inconsistent data received from the
    source.
  - Minor fixes for style, wording and typos.

Version 2:
  - Dropped implementation on top of active-backup replication.
  - Implemented new 'relay' service model.
  - Updated documentation and wrote a separate topic with examples
    and ascii-graphics.  That's why v2 seem larger.

Ilya Maximets (9):
  jsonrpc-server: Wake up jsonrpc session if there are completed
    triggers.
  ovsdb: storage: Allow setting the name for the unbacked storage.
  ovsdb: table: Expose functions to execute operations on ovsdb tables.
  ovsdb: row: Add support for xor-based row updates.
  ovsdb: New ovsdb 'relay' service model.
  ovsdb: relay: Add support for transaction forwarding.
  ovsdb: relay: Reflect connection status in _Server database.
  ovsdb: Make clients aware of relay service model.
  docs: Add documentation for ovsdb relay mode.

 Documentation/automake.mk            |   1 +
 Documentation/ref/ovsdb.7.rst        |  62 ++++-
 Documentation/topics/index.rst       |   1 +
 Documentation/topics/ovsdb-relay.rst | 124 +++++++++
 NEWS                                 |   3 +
 lib/ovsdb-cs.c                       |  15 +-
 ovsdb/_server.ovsschema              |   7 +-
 ovsdb/_server.xml                    |  35 +--
 ovsdb/automake.mk                    |   4 +
 ovsdb/execution.c                    |  18 +-
 ovsdb/file.c                         |   2 +-
 ovsdb/jsonrpc-server.c               |   3 +-
 ovsdb/ovsdb-client.c                 |   2 +-
 ovsdb/ovsdb-server.1.in              |  27 +-
 ovsdb/ovsdb-server.c                 | 105 +++++---
 ovsdb/ovsdb.c                        |  11 +
 ovsdb/ovsdb.h                        |   9 +-
 ovsdb/relay.c                        | 385 +++++++++++++++++++++++++++
 ovsdb/relay.h                        |  38 +++
 ovsdb/replication.c                  |  83 +-----
 ovsdb/row.c                          |  30 ++-
 ovsdb/row.h                          |   6 +-
 ovsdb/storage.c                      |  13 +-
 ovsdb/storage.h                      |   2 +-
 ovsdb/table.c                        |  70 +++++
 ovsdb/table.h                        |  14 +
 ovsdb/transaction-forward.c          | 182 +++++++++++++
 ovsdb/transaction-forward.h          |  44 +++
 ovsdb/trigger.c                      |  49 +++-
 ovsdb/trigger.h                      |  41 +--
 python/ovs/db/idl.py                 |  16 ++
 tests/ovsdb-server.at                |  85 +++++-
 tests/test-ovsdb.c                   |   6 +-
 33 files changed, 1297 insertions(+), 196 deletions(-)
 create mode 100644 Documentation/topics/ovsdb-relay.rst
 create mode 100644 ovsdb/relay.c
 create mode 100644 ovsdb/relay.h
 create mode 100644 ovsdb/transaction-forward.c
 create mode 100644 ovsdb/transaction-forward.h

-- 
2.31.1

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to