On 17/08/2021 13:52, Ilya Maximets wrote:
On 8/17/21 1:27 PM, Anton Ivanov wrote:
Hi Ilia, hi list,

I ran some detailed experiments and there is an issue with all forms of 
"skipping" and/or reordering processing.

If the session list is skipped or reordered (I tried "fast-forwarding" the list 
to a new head position after hitting a time constraint), ovsdb fails to issue the 
response to some transactions when running the cluster test suite.

At present I am unable to get to the root cause.

The issue does not exist if processing bails out of the session loop and is 
re-run IN FULL (as in the earliest versions of the patch).
That is weird.  I'm not sure how the re-ordering is different from
the 're-run in full' here.  The only thing that different is an
actual order in which sessions are processed, because we're still
re-running all of them in full until the time allows.

Statement of the fact - all varieties of skipping resulted in this from time to 
time. I went through the code about 20 times yesterday and I cannot figure out 
what causes it.

The only thing which worked at the end was to reorder the list as follows (this 
is in version 8) and re-run sessions in full:

1. Cut elements from the next element at the point where processing was interrupted to 
tail out. Create a new list with these elements (all unprocessed elements). The 
"next" is now at head position.

2. Push the processed elements at the end of this new list.

3. Replace the original list with this rearranged list.

There is no skipping at session level - they are all re-run again immediately 
after giving raft_run a chance to work, just in different order - first the 
unprocessed ones, then the ones that were processed prior to the interruption.

I also tried a few other approaches - f.e. rearranging the list order on each 
iteration. They can be also made to work provided that there is a full re-run 
and no skipping.

A.


I am going to re-issue the patch without any skipping whatsoever (either at 
remotes or at sessions level), because that works and improves raft (and 
overall ovn) stability.

While there may be some starvation of the sessions towards the end of the 
session list, it should be a second order effect, because re-processing 
sessions which have just been processed generates only a minimal amount of 
changes.

Skipping (if any) will be a later optimization after I get to the bottom of 
this and figure out why monitor updates are not followed by the transaction 
response.
This doesn't sound good to me.  It's pretty easy to spam the
ovsdb-server with monitor requests or condition changes.  This
requires walk across the whole database.  And if the database
is big enough, other sessions will never be served due to one
faulty/malicious connection.   It's also possible that we
have a few thousands connections and processing of all of them
legitimately takes a lot of time.  This will be a problem
if the rate of database changes is relatively high and constant.

Best regards, Ilya Maximets.

--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to