Hi Ben/Numan:

So I went ahead to try out clustered db in scale env and results are not as
expected.
OVS branch used is master;
commit-id(2062840612904ff0873d46b2f4f226bc23f2296d)

Setup is uing 3 nodes.

Also disabled inactivity_probe,
ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
192.168.220.103:6641" set NB . external_ids:inactivity_probe=0
ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp:
192.168.220.103:6642" set SB . external_ids:inactivity_probe=0


1. With wait_up = true for 10k lports and 1k HVs which internally uses
wait-until, it was able to create around 1k lports and rally exited due db
connection error.
Cause of failure: database connection failed is @
https://raw.githubusercontent.com/noah8713/ovn-scale-test/e398ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html

Some data from sb db logs are as below
2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout at
lib/reconnect.c:643 (90% CPU usage)
2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid
23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite
966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request
2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew 51%
in last 35031.7 seconds, from 24748 kB to 37332 kB
2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1
monitors:3 sessions:431
2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received as
leader
2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms poll
interval (1044ms user, 232ms system)
2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major
2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0 voluntary,
3 involuntary
2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate over
last: 5 seconds, last minute, last hour,  hash=1dd2af20:
2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological
 25.0/sec    24.633/sec        1.5550/sec   total: 19274
2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand
4202.6/sec  4041.417/sec      578.9233/sec   total: 14158936
2018-04-05T06:32:27.959Z|00081|coverage|INFO|lockfile_lock
0.0/sec     0.000/sec        0.0000/sec   total: 1
2018-04-05T06:32:27.960Z|00082|coverage|INFO|poll_create_node
 50281.4/sec 47195.233/sec    29424.6886/sec   total: 1001238470
2018-04-05T06:32:27.960Z|00083|coverage|INFO|poll_zero_timeout
9.8/sec    12.167/sec        8.2392/sec   total: 312540
2018-04-05T06:32:27.960Z|00084|coverage|INFO|seq_change
 122.8/sec   115.967/sec       68.2761/sec   total: 2309640
2018-04-05T06:32:27.961Z|00085|coverage|INFO|pstream_open
 0.0/sec     0.000/sec        0.0000/sec   total: 4
2018-04-05T06:32:27.961Z|00086|coverage|INFO|stream_open
0.0/sec     0.000/sec        0.0000/sec   total: 9
2018-04-05T06:32:27.961Z|00087|coverage|INFO|util_xalloc
229362.2/sec 218086.767/sec    46494.5058/sec   total: 1344641131



some data from nb db logs:
2018-04-05T23:35:33.458Z|00020|raft|INFO|adding b7bd (b7bd at tcp:
192.168.220.103:6645) to cluster a36d succeeded (completed)
2018-04-05T23:53:06.360Z|00021|jsonrpc|WARN|tcp:192.168.10.10:57298:
receive error: Connection reset by peer
2018-04-05T23:53:06.360Z|00022|reconnect|WARN|tcp:192.168.10.10:57298:
connection dropped (Connection reset by peer)
2018-04-05T23:53:09.071Z|00023|jsonrpc|WARN|tcp:192.168.10.10:57348:
receive error: Connection reset by peer
2018-04-05T23:53:09.071Z|00024|reconnect|WARN|tcp:192.168.10.10:57348:
connection dropped (Connection reset by peer)
^2018-04-06T00:02:03.864Z|00076|raft|INFO|current entry eid
0b9eeff2-8377-4f6d-a907-cb4e98396d14 does not match prerequisite
277c09b4-96b3-43e9-b482-a9df29576988
2018-04-06T00:02:06.882Z|00077|poll_loop|INFO|Dropped 6187 log messages in
last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2018-04-06T00:02:06.882Z|00078|poll_loop|INFO|wakeup due to [POLLIN] on fd
31 (192.168.220.102:6645<->192.168.10.10:54782) at lib/stream-fd.c:157 (62%
CPU usage)
2018-04-06T00:02:09.130Z|00079|jsonrpc|WARN|tcp:192.168.10.10:33592:
receive error: Connection reset by peer
2018-04-06T00:02:09.131Z|00080|reconnect|WARN|tcp:192.168.10.10:33592:
connection dropped (Connection reset by peer)
2018-04-06T00:02:33.986Z|00081|poll_loop|INFO|Dropped 1509 log messages in
last 26 seconds (most recently, 24 seconds ago) due to excessive rate
2018-04-06T00:02:33.986Z|00082|poll_loop|INFO|wakeup due to [POLLIN] on fd
24 (FIFO pipe:[40290445]) at ovsdb/log.c:994 (51% CPU usage)
2018-04-06T00:02:33.986Z|00083|poll_loop|INFO|wakeup due to [POLLIN] on fd
29 (192.168.220.102:6645<->192.168.10.10:55738) at lib/stream-fd.c:157 (51%
CPU usage)



2018-04-06T00:13:48.141Z|00037|raft|INFO|rejecting term 4 < current term 5
received in vote_reply message from server b7bd
2018-04-06T00:14:21.138Z|00038|jsonrpc|WARN|tcp:192.168.10.10:33990:
receive error: Connection reset by peer
2018-04-06T00:14:21.138Z|00039|reconnect|WARN|tcp:192.168.10.10:33990:
connection dropped (Connection reset by peer)
2018-04-06T00:17:01.532Z|00040|raft|INFO|current entry eid
e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
d47dbdf1-0981-4077-baab-372b05aeb0bb
2018-04-06T00:17:01.532Z|00041|raft|INFO|current entry eid
e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
d47dbdf1-0981-4077-baab-372b05aeb0bb
2018-04-06T00:17:01.532Z|00042|raft|INFO|current entry eid
e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
d47dbdf1-0981-4077-baab-372b05aeb0bb
2018-04-06T00:17:01.533Z|00043|raft|INFO|current entry eid
e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
d47dbdf1-0981-4077-baab-372b05aeb0bb
2018-04-06T00:17:01.533Z|00044|raft|INFO|current entry eid
e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
d47dbdf1-0981-4077-baab-372b05aeb0bb
2018-04-06T00:19:32.039Z|00045|timeval|WARN|Unreasonably long 1373ms poll
interval (1340ms user, 24ms system)
2018-04-06T00:19:32.039Z|00046|timeval|WARN|faults: 319 minor, 0 major
2018-04-06T00:19:32.039Z|00047|timeval|WARN|disk: 0 reads, 368 writes
2018-04-06T00:19:32.039Z|00048|timeval|WARN|context switches: 5 voluntary,
19 involuntary
2018-04-06T00:19:32.039Z|00049|coverage|INFO|Event coverage, avg rate over
last: 5 seconds, last minute, last hour,  hash=1dd2af20:
2018-04-06T00:19:32.039Z|00050|coverage|INFO|hmap_pathological
0.0/sec     0.000/sec        3.4164/sec   total: 12306
2018-04-06T00:19:32.039Z|00051|coverage|INFO|hmap_expand
 73.6/sec    71.267/sec      729.9656/sec   total: 2628980
2018-04-06T00:19:32.039Z|00052|coverage|INFO|lockfile_lock
0.0/sec     0.000/sec        0.0003/sec   total: 1
2018-04-06T00:19:32.039Z|00053|coverage|INFO|poll_create_node
 129.6/sec   122.333/sec      866.7178/sec   total: 3120304
2018-04-06T00:19:32.039Z|00054|coverage|INFO|poll_zero_timeout
0.0/sec     0.267/sec       27.4550/sec   total: 98838
2018-04-06T00:19:32.039Z|00055|coverage|INFO|seq_change
10.8/sec    10.217/sec      135.4125/sec   total: 487498
2018-04-06T00:19:32.039Z|00056|coverage|INFO|pstream_open
 0.0/sec     0.000/sec        0.0011/sec   total: 4
2018-04-06T00:19:32.039Z|00057|coverage|INFO|stream_open
0.0/sec     0.000/sec        0.0019/sec   total: 7
2018-04-06T00:19:32.039Z|00058|coverage|INFO|util_xalloc
1542.2/sec  1488.750/sec    21346.2483/sec   total: 76872987
2018-04-06T00:19:32.039Z|00059|coverage|



To find that wait-until was only the bottleneck, I skipped wait-until and
test got completed in 2.4 hours but things did not go as expected. All the
chassis ovn-controllers crashed and ports actually did not get bind all the
way down to the compute as ovs sb db was running super hot

Below is the controller log from sample HV without wait-until.
2018-04-06T04:50:40.274Z|01703|poll_loop|INFO|Dropped 5903 log messages in
last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2018-04-06T04:50:40.274Z|01704|poll_loop|INFO|wakeup due to
[POLLIN][POLLHUP] on fd 13 (<->/root/sandbox-192.168.72.45/br-int.mgmt) at
../lib/stream-fd.c:157 (100% CPU usage)
2018-04-06T04:50:41.060Z|01705|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
clustered database server is disconnected from cluster; trying another
server
2018-04-06T04:50:41.066Z|01706|binding|INFO|Claiming lport
lport_a3060c_mlAOEq for this chassis.
2018-04-06T04:50:41.066Z|01707|binding|INFO|lport_a3060c_mlAOEq: Claiming
a2:1c:47:32:30:52 172.145.23.37
2018-04-06T04:50:41.066Z|01708|vconn_stream|ERR|send: Broken pipe
2018-04-06T04:50:41.066Z|01709|rconn|WARN|unix:/root/sandbox-192.168.72.45/br-int.mgmt:
connection dropped (Broken pipe)
2018-04-06T04:50:41.067Z|01710|rconn|INFO|unix:/root/sandbox-192.168.72.45/br-int.mgmt:
connected
2018-04-06T04:50:41.067Z|01711|rconn|WARN|unix:/root/sandbox-192.168.72.45/br-int.mgmt:
connection dropped (Broken pipe)
2018-04-06T04:50:41.067Z|01712|util|EMER|../lib/ofp-msgs.c:1062: assertion
version >= info->min_version && version <= info->max_version failed in
raw_instance_get()



Also northd sees many transaction errors without wait-until:
northd.log
2018-04-06T04:25:26.594Z|01076|ovsdb_idl|WARN|transaction error:
{"details":"Transaction causes multiple rows in \"Port_Binding\" table to
have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 104) for
index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID
489f777f-d7f6-4b92-80b9-c1c64c083ea4, existed in the database before this
transaction and was not modified by the transaction.  Second row, with UUID
b59a2500-3da1-40ab-932c-5e2dd4acc3b4, was inserted by this
transaction.","error":"constraint violation"}
2018-04-06T04:25:27.202Z|01077|ovsdb_idl|WARN|transaction error:
{"details":"Transaction causes multiple rows in \"Port_Binding\" table to
have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 26) for
index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID
0668b300-55f8-4bdd-b0a8-8038258d4502, was inserted by this transaction.
Second row, with UUID 63972d45-8845-4489-ac3e-916909a593c7, existed in the
database before this transaction and was not modified by the
transaction.","error":"constraint violation"}
2018-04-06T04:25:32.036Z|01078|ovsdb_idl|WARN|transaction error:
{"details":"Transaction causes multiple rows in \"Port_Binding\" table to
have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 224) for
index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID
528dea47-f284-40d6-bc7e-a97a084f49eb, existed in the database before this
transaction and was not modified by the transaction.  Second row, with UUID
6fcc2500-fedb-4ee9-b54d-ea548dd36936, was inserted by this
transaction.","error":"constraint violation"}
2018-04-06T04:25:32.482Z|01079|ovsdb_idl|WARN|transaction error:
{"details":"Transaction causes multiple rows in \"Port_Binding\" table to
have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 23) for
index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID
925eb001-adeb-4748-a55f-6c5812e70a7e, was inserted by this transaction.
Second row, with UUID 5ed7bbc5-b7f5-4d4e-8e99-d8c4adcb2b34, existed in the
database before this transaction and was not modified by the
transaction.","error":"constraint violation"}
2018-04-06T04:25:35.204Z|01080|ovsdb_idl|WARN|transaction error:
{"details":"Transaction causes multiple rows in \"Port_Binding\" table to
have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 18) for
index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID
4b429b00-75ca-4e38-904d-80a11e6c0f38, was inserted by this transaction.
Second row, with UUID c2db03fe-18b6-4a2b-8db0-bf37ef93db92, existed in the
database before this transaction and was not modified by the
transaction.","error":"constraint violation"}
2018-04-06T04:25:35.933Z|01081|ovsdb_idl|INFO|tcp:192.168.220.101:6642:
clustered database server is not cluster leader; trying another server


To summerize:

   - ovn nb db has issues with wait-until and works fine without wait-until
   - ovn sb db causes controller to crash due to more HVs and just few
   hundred of ports with no solid root cause without wait-until. Even sbctl
   commands hangs as sb db is 100% on cpu.
   - ovn northd sees transaction error without wait-until



Will continue to troubleshoot and would love to get any
improvements/suggestions if any.


Regards,
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to