On 4/23/20 8:22 PM, Ilya Maximets wrote: > On 4/22/20 8:38 PM, Dumitru Ceara wrote: >> Assuming an ovsdb client connected to a database using OVSDB_MONITOR_V3 >> (i.e., "monitor_cond_since" method) with the initial monitor condition >> MC1. >> >> Assuming the following two transactions are executed on the >> ovsdb-server: >> TXN1: "insert record R1 in table T1" >> TXN2: "insert record R2 in table T2" >> >> If the client's monitor condition MC1 for table T2 matches R2 then the >> client will receive the following update3 message: >> method="update3", "insert record R2 in table T2", last-txn-id=TXN2 >> >> At this point, if the presence of the new record R2 in the IDL triggers >> the client to update its monitor condition to MC2 and add a clause for >> table T1 which matches R1, a monitor_cond_change message is sent to the >> server: >> method="monitor_cond_change", "clauses from MC2" >> >> In normal operation the ovsdb-server will reply with a new update3 >> message of the form: >> method="update3", "insert record R1 in table T1", last-txn-id=TXN2 >> >> However, if the connection drops in the meantime, this last update might >> get lost. >> >> It might happen that during the reconnect a new transaction happens >> that modifies the original record R1: >> TXN3: "modify record R1 in table T1" >> >> When the client reconnects, it will try to perform a fast resync by >> sending: >> method="monitor_cond_since", "clauses from MC2", last-txn-id=TXN2 >> >> Because TXN2 is still in the ovsdb-server transaction history, the >> server replies with the changes from the most recent transactions only, >> i.e., TXN3: >> result="true", last-txbb-id=TXN3, "modify record R1 in table T1" >> >> This causes the IDL on the client in to end up in an inconsistent >> state because it has never seen the update that created R1. >> >> Such a scenario is described in: >> https://bugzilla.redhat.com/show_bug.cgi?id=1808580#c22 >> >> To avoid hitting this issue, whenever a reconnect happens (either >> due to an IDL retry or due to network issues), we clear db->last_id >> in ovsdb_idl_restart_fsm() in any of the following cases: >> - a monitor condition has been changed locally and the corresponding >> request was not sent yet (i.e., idl->data.cond_changed == true). >> - a monitor_cond_change request is in flight. >> >> This ensures that updates of type "insert" that happened before the last >> transaction known by the IDL but didn't match old monitor conditions are >> sent upon reconnect if the monitor condition has changed to include them >> in the meantime. >> >> CC: Han Zhou <[email protected]> >> Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when connection >> reset.") >> Signed-off-by: Dumitru Ceara <[email protected]> >> --- >> lib/ovsdb-idl.c | 19 +++++++++++++++++++ >> 1 file changed, 19 insertions(+) >> >> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c >> index 1535ad7..67c4745 100644 >> --- a/lib/ovsdb-idl.c >> +++ b/lib/ovsdb-idl.c >> @@ -690,6 +690,25 @@ ovsdb_idl_send_request(struct ovsdb_idl *idl, struct >> jsonrpc_msg *request) >> static void >> ovsdb_idl_restart_fsm(struct ovsdb_idl *idl) >> { >> + /* If there's an outstanding request of type monitor_cond_change and >> + * we're in monitor_cond_since mode then we can't trust that all >> relevant >> + * updates from transaction idl->data.last_id have been received as we >> + * might have relaxed the monitor condition with our last request and >> + * might be missing previously not monitored records. >> + * >> + * Same reasoning applies for the case when a monitor condition has been >> + * changed locally but the monitor_cond_change request was not sent yet. >> + * >> + * In both cases, clear last_id to make sure that the next time >> + * monitor_cond_since is sent (i.e., after reconnect) we get the >> complete >> + * view of the database. >> + */ >> + if (idl->data.cond_changed || >> + (idl->request_id && >> + idl->data.monitoring == OVSDB_IDL_MONITORING_COND_SINCE)) { >> + uuid_zero(&idl->data.last_id); > > As Han pointed out during OVN IRC meeting, when there is a change to SB, e.g. > creating a new DP, it causes all HVs changing the condition, and at the same > time > one of the SB servers might be disconnected triggering this bug on a big > number > of nodes at once. If all of these nodes will request the full database at the > same time, this might be an issue. > > One possibility that came to mind is that we could store 'last_id' for the > previous successful cond_change and use it instead of 0 on re-connection if > there > was in-flight cond_change. This way we could reduce the pressure on SB > server.
Thinking more about this, this idea should not work because with new conditions we might need to receive changes that happened even before the last successful cond_change because we might relax conditions incrementally. So, current solution seems the most correct for now. I'd like to hear some other thoughts about this issue with massive re-connections, though, if any. > >> + } >> + >> ovsdb_idl_send_schema_request(idl, &idl->server); >> ovsdb_idl_transition(idl, IDL_S_SERVER_SCHEMA_REQUESTED); >> idl->data.monitoring = OVSDB_IDL_NOT_MONITORING; >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
