On 4/23/20 8:22 PM, Ilya Maximets wrote:
> On 4/22/20 8:38 PM, Dumitru Ceara wrote:
>> Assuming an ovsdb client connected to a database using OVSDB_MONITOR_V3
>> (i.e., "monitor_cond_since" method) with the initial monitor condition
>> MC1.
>>
>> Assuming the following two transactions are executed on the
>> ovsdb-server:
>> TXN1: "insert record R1 in table T1"
>> TXN2: "insert record R2 in table T2"
>>
>> If the client's monitor condition MC1 for table T2 matches R2 then the
>> client will receive the following update3 message:
>> method="update3", "insert record R2 in table T2", last-txn-id=TXN2
>>
>> At this point, if the presence of the new record R2 in the IDL triggers
>> the client to update its monitor condition to MC2 and add a clause for
>> table T1 which matches R1, a monitor_cond_change message is sent to the
>> server:
>> method="monitor_cond_change", "clauses from MC2"
>>
>> In normal operation the ovsdb-server will reply with a new update3
>> message of the form:
>> method="update3", "insert record R1 in table T1", last-txn-id=TXN2
>>
>> However, if the connection drops in the meantime, this last update might
>> get lost.
>>
>> It might happen that during the reconnect a new transaction happens
>> that modifies the original record R1:
>> TXN3: "modify record R1 in table T1"
>>
>> When the client reconnects, it will try to perform a fast resync by
>> sending:
>> method="monitor_cond_since", "clauses from MC2", last-txn-id=TXN2
>>
>> Because TXN2 is still in the ovsdb-server transaction history, the
>> server replies with the changes from the most recent transactions only,
>> i.e., TXN3:
>> result="true", last-txbb-id=TXN3, "modify record R1 in table T1"
>>
>> This causes the IDL on the client in to end up in an inconsistent
>> state because it has never seen the update that created R1.
>>
>> Such a scenario is described in:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1808580#c22
>>
>> To avoid hitting this issue, whenever a reconnect happens (either
>> due to an IDL retry or due to network issues), we clear db->last_id
>> in ovsdb_idl_restart_fsm() in any of the following cases:
>> - a monitor condition has been changed locally and the corresponding
>>   request was not sent yet (i.e., idl->data.cond_changed == true).
>> - a monitor_cond_change request is in flight.
>>
>> This ensures that updates of type "insert" that happened before the last
>> transaction known by the IDL but didn't match old monitor conditions are
>> sent upon reconnect if the monitor condition has changed to include them
>> in the meantime.
>>
>> CC: Han Zhou <[email protected]>
>> Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when connection 
>> reset.")
>> Signed-off-by: Dumitru Ceara <[email protected]>
>> ---
>>  lib/ovsdb-idl.c |   19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>> index 1535ad7..67c4745 100644
>> --- a/lib/ovsdb-idl.c
>> +++ b/lib/ovsdb-idl.c
>> @@ -690,6 +690,25 @@ ovsdb_idl_send_request(struct ovsdb_idl *idl, struct 
>> jsonrpc_msg *request)
>>  static void
>>  ovsdb_idl_restart_fsm(struct ovsdb_idl *idl)
>>  {
>> +    /* If there's an outstanding request of type monitor_cond_change and
>> +     * we're in monitor_cond_since mode then we can't trust that all 
>> relevant
>> +     * updates from transaction idl->data.last_id have been received as we
>> +     * might have relaxed the monitor condition with our last request and
>> +     * might be missing previously not monitored records.
>> +     *
>> +     * Same reasoning applies for the case when a monitor condition has been
>> +     * changed locally but the monitor_cond_change request was not sent yet.
>> +     *
>> +     * In both cases, clear last_id to make sure that the next time
>> +     * monitor_cond_since is sent (i.e., after reconnect) we get the 
>> complete
>> +     * view of the database.
>> +     */
>> +    if (idl->data.cond_changed ||
>> +            (idl->request_id &&
>> +                idl->data.monitoring == OVSDB_IDL_MONITORING_COND_SINCE)) {
>> +        uuid_zero(&idl->data.last_id);
> 
> As Han pointed out during OVN IRC meeting,  when there is a change to SB, e.g.
> creating a new DP, it causes all HVs changing the condition, and at the same 
> time
> one of the SB servers might be disconnected triggering this bug on a big 
> number
> of nodes at once.  If all of these nodes will request the full database at the
> same time, this might be an issue.
> 
> One possibility that came to mind is that we could store 'last_id' for the
> previous successful cond_change and use it instead of 0 on re-connection if 
> there
> was in-flight cond_change.  This way we could reduce the pressure on SB 
> server.

Thinking more about this, this idea should not work because with new conditions 
we
might need to receive changes that happened even before the last successful 
cond_change
because we might relax conditions incrementally.

So, current solution seems the most correct for now.  I'd like to hear some 
other
thoughts about this issue with massive re-connections, though, if any.

> 
>> +    }
>> +
>>      ovsdb_idl_send_schema_request(idl, &idl->server);
>>      ovsdb_idl_transition(idl, IDL_S_SERVER_SCHEMA_REQUESTED);
>>      idl->data.monitoring = OVSDB_IDL_NOT_MONITORING;
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to