On Wed, Apr 29, 2020 at 12:17 PM Numan Siddique <[email protected]> wrote: > > > > On Wed, Apr 29, 2020 at 9:57 PM Dumitru Ceara <[email protected]> wrote: >> >> In some cases, if the NB/SB databases ovn-northd connects to are >> inconsistent, ovn-northd might generate transactions that fail >> continuously due to failed integrity checks on the SB database server. >> >> The first patch of the series addresses inconsistencies due to stale >> Datapath_Binding records in the SB database. >> >> The second patch of the series addresses inconsistencies due to stale >> tunnel_key values in various SB database table records. >> >> Reported-by: Dan Williams <[email protected]> >> Reported-at: https://bugzilla.redhat.com/1828637 >> Signed-off-by: Dumitru Ceara <[email protected]> >> >> Dumitru Ceara (2): >> ovn-northd: Clear SB records depending on stale datapaths. >> ovn-northd: Fix tunnel_key allocation for SB records. > > > Hi Dumitru, > > I did some testing in my ovn-fake-multinode setup. These are my observations. > > I created a logical switch sw0 with 4 logical ports. So the next tunnel key should be 5. > I stopped ovn-northd and created a couple of port_binding entries manually using > "ovn-sbctl create port_binding" with tunnel keys 5 and 6. > I also created a logical port in sw0. Then I started ovn-northd. ovn-northd deletes the port binding > entries added by me and creates the port_binding entry for the logical port with the tunnel_key=5 > in the same transaction. > > I think ovn-northd syncs the south db based on the contents of the north db. > > There's no harm in having your patches. But I'm not really sure if it resolves the issue we have observed. > > Just to brief everyone about the issue we are seeing, we see below logs in ovn-northd. > > ******* > 2020-04-16T23:02:33Z|00127|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Port_Binding\" table to have identical values (23eb9016-45f9-4158-be35-77b2713b9a0f and 7) for index on columns \"datapath\" and \"tunnel_key\". First row, with UUID e4f11a7b-09b6-454f-a125-34cc4b144ef6, had the following index values before the transaction: bdbb436e-f98c-4651-9b80-6e8b95044560 and 7. Second row, with UUID d37cc3f1-8633-440f-b145-8222a0d4723c, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"} > ****** > > And because of this constraint violation error, ovn-northd cannot further write to the sb db until it is restarted. > > In my opinion this can only happen if ovn-northd doesn't see the port binding row (which is actually present in the DB) in its IDL in-memory db. > I suspect this could have happened when ovn-northd reconnects to the same master or connects to the new master and it doesn't get the proper > updates. > > Maybe in this case, the IDL should request the db contents with txn id =0, so that it receives the complete dump of the db. > > Is it possible that ovn-northd sees a port binding with a tunnel key 'x' and still allocates the same tunnel id 'x' to a new logical port ? > If so, then definitely your patches makes sense. > > @Han - Have you seen this issue in your deployments ? Do you have comments here ? > > Thanks > Numan > I never saw such issue before, but I am not sure if this is possible due to bugs. Currently there is a bug fix under review: https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/. However, northd doesn't conditionally monitor the rows so I am not sure if this is the root cause of the northd inconsistency issue discussed here.
I don't think we should fix in northd (or ovn-controller) to handle the inconsistency of ovsdb. The consistency should be expected from ovsdb and we should fix ovsdb/IDL when there is such kind of bug. Otherwise, there might be too many places to fix and even re-design. My understanding is, if the ovsdb IDL sees a temporarily stale data, the current northd/ovn-controller logic should be able to correct themselves once the data is up-to-date. Moreover, for northd, it is connected to leader-only in clustered mode, which avoids the possibility of seeing staled data in northd (unless there is a bug). To summarize, I think we need to find the root cause of the inconsistency between IDL and server and fix it there, instead of changing ovn-northd to accommodate the inconsistency. (consistency is the biggest advantage of OVSDB, to ease the application implementation). Thanks, Han _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
