On 3/5/20 9:50 PM, Ben Pfaff wrote:
> On Tue, Mar 03, 2020 at 01:47:48PM +0100, Dumitru Ceara wrote:
>> CC-ing Andy and Ben.
>>
>> On 3/3/20 7:43 AM, Han Zhou wrote:
>>>
>>>
>>> On Mon, Mar 2, 2020 at 7:55 AM Dumitru Ceara <dce...@redhat.com
>>> <mailto:dce...@redhat.com>> wrote:
>>>>
>>>> On 2/29/20 12:00 AM, Dumitru Ceara wrote:
>>>>> If the ovsdb-server reply to "monitor_cond_since" requests has
>>>>> "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
>>>>> ovsdb_idl_db_clear() which iterates through all tables and
>>>>> unconditionally sets table->cond_changed to false.
>>>>>
>>>>> However, if the client had already set a new condition for some of the
>>>>> tables, this new condition request will never be sent to ovsdb-server
>>>>> until the condition is reset to a different value. This is due to the
>>>>> check in ovsdb_idl_db_set_condition().
>>>>>
>>>>> In order to fix this we now also call ovsdb_idl_condition_clear() for
>>>>> each table condition in ovsdb_idl_db_clear(). This ensures that
>>>>> resetting the condition to the same value as before the
>>>>> "monitor_cond_since" reply will trigger a new request.
>>>>>
>>>>> One way to replicate the issue is described in the bugzilla reporting
>>>>> the bug, when ovn-controller is configured to use "ovn-monitor-all":
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
>>>>>
>>>>> Reported-by: Dan Williams <d...@redhat.com <mailto:d...@redhat.com>>
>>>>> Reported-at: https://bugzilla.redhat.com/1808125
>>>>> CC: Han Zhou <hz...@ovn.org <mailto:hz...@ovn.org>>
>>>>> Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
>>> connection reset.")
>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com
>>> <mailto:dce...@redhat.com>>
>>>>
>>>> Hi Han,
>>>>
>>>> I was wondering if you could review this change because ovn-kubernetes
>>>> is trying to enable ovn-monitor-all for scalability reasons but their CI
>>>> fails due to the bug fixed by the patch below.
>>>>
>>>> Thanks,
>>>> Dumitru
>>>>
>>>>> ---
>>>>>  lib/ovsdb-idl.c | 6 +++++-
>>>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>>>>> index 190143f..64721ca 100644
>>>>> --- a/lib/ovsdb-idl.c
>>>>> +++ b/lib/ovsdb-idl.c
>>>>> @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
>>>>>          struct ovsdb_idl_table *table = &db->tables[i];
>>>>>          struct ovsdb_idl_row *row, *next_row;
>>>>>
>>>>> -        table->cond_changed = false;
>>>>> +        if (table->cond_changed) {
>>>>> +            table->cond_changed = false;
>>>>> +            ovsdb_idl_condition_clear(&table->condition);
>>>>> +        }
>>>>> +
>>>>>          if (hmap_is_empty(&table->rows)) {
>>>>>              continue;
>>>>>          }
>>>>>
>>>>
>>>
>>> Hi Dumitru>
>>> If I understand the problem, it is when the client updated monitor
>>> condition and at the same time it received monitor_cond_since reply from
>>> server for it's previous conditional monitor request, and the handling
>>> of the reply overwrite the flag table->cond_changed from true to false,
>>> and because of this, the new condition is not sent to server.
>>>
>> Hi Han,
>>
>> Yes, that's what's happening.
>>
>>> However, I don't understand the fix. It seems that the fix should be
>>> "don't overwrite the flag", instead of clearing the conditions as well.
>>> Also, by simply looking at this code I don't understand why the flag
>>> needs to be set to false when clearing data in IDL. What would be the
>>> problem if we keep it as what it is? If you understand the details,
>>> could you explain more about the fix?
>>>
>>> In addition, I didn't change this logic when implementing
>>> monitor_cond_since in 403a6a0cb003 ("ovsdb-idl: Fast resync from server
>>> when connection reset."). In that commit it only reduced the chance of
>>> calling ovsdb_idl_db_clear() as an optimization. Before that commit,
>>> ovsdb_idl_db_parse_monitor_reply() always calls ovsdb_idl_db_clear() and
>>> it would result in same problem.
>> You're right, my bad, I messed up the "git blame", sorry about that. It
>> was actually introduced by:
>>
>> https://github.com/openvswitch/ovs/commit/5351980b047f4dd40be7a59a1e4b910df21eca0a
>>
>> I'll fix up the commit message in V2 once we decide on the best approach.
>>
>> Hi Andy, Ben,
>>
>> As far as I understand, whenever ovsdb_idl_db_clear() is called during
>> reconnect, any buffered but unsent conditions should be cleared because
>> they will be anyway resent with the following monitoring condition
>> update message after reconnect has completed.
>>
>> However, clearing the "table->cond_changed" flag is not enough because
>> when the IDL client (ovn-controller in my case) will try to set the same
>> condition as before (e.g., "true" if ovn-monitor-all=true)
>> ovsdb_idl_db_set_condition() will return early:
>>
>> https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L1518
>>
>> Therefore in the cases when ovsdb_idl_db_clear() is called, we should
>> also clear the existing conditions on the DB tables.
>>
>> As Han suggested, we could also revert the changes from 5351980b047f
>> ("ovsdb-idl: Avoid sending redundant conditional monitoring updates")
>> and we wouldn't hit the issue anymore. I'm not sure however if this has
>> any other major implications.
>>
>> What do you think?
> 
> I honestly don't understand what's going on here well enough at the
> moment.  There's a lot of context.  If necessary, I can try to untangle
> it, but I'd rather leave it to the two of you if you can figure it out.
> 

Hi Han,

I just sent a v2 of the patch. The code changes are the same as in v1,
just that I updated the "Fixes" tag to reflect the correct commit that
introduced the issue:

https://patchwork.ozlabs.org/patch/1252278/

It would be great if you could help review this new version of the patch.

Thanks,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to