On 1/27/22 19:09, Han Zhou wrote: > On Thu, Jan 27, 2022 at 3:29 AM Dumitru Ceara <[email protected]> wrote: >> >> On 1/26/22 01:22, Han Zhou wrote: >>> >>> >>> On Tue, Jan 11, 2022 at 8:38 AM Dumitru Ceara <[email protected] >>> <mailto:[email protected]>> wrote: >>>> >>>> When reconnecting, if there are condition changes already sent to the >>>> server but not yet acked, reset the db's 'last-id', esentially clearing >>>> the local cache after reconnect. >>>> >>>> This is needed because the client cannot easily differentiate between >>>> the following cases: >>>> a. either the server already processed the requested monitor >>>> condition change but the FSM was restarted before the >>>> client was notified. In this case the client should >>>> clear its local cache because it's out of sync with the >>>> monitor view on the server side. >>>> b. OR the server hasn't processed the requested monitor >>>> condition change yet. >>>> >>>> Conditions changing at the same time with a reconnection happening are >>>> rare so the performance impact of this patch should be minimal. >>>> >>>> Also, the tests are updated to cover the fact that we cannot control >>>> which of the two scenarios ("a" and "b" above) are hit during the test. >>>> >>>> Reported-by: Maxime Coquelin <[email protected] >>> <mailto:[email protected]>> >>>> Signed-off-by: Dumitru Ceara <[email protected] >>> <mailto:[email protected]>> >>> >>> Thanks Dumitru. Could you tell more details about how this problem is >>> reproduced? Or are there test case failures because of this? >> >> Hi Han, >> >> Thanks for looking into this. >> >> Yes, Maxime reported that the test "2241: simple idl, monitor_cond_since, >> cluster disconnect - C - tcp" would occasionally fail without this patch. >> >>> I still didn't understand how the problem would occur. When a client >>> reconnects to the server, it should trigger the jsonrpc session >>> recreation on the server side, so the monitor data for the session >>> should be recreated, and so there shouldn't be "out of sync" view on the >>> server side. Did I misunderstand? >> >> >> Going through the test logs we see (simplified and edited, skipping the >> less relevant parts): >> >> 0. initial contents of the "simple" table on the server side: >> "row": {"i": 1, "r": 1.0, "b": true} >> "row": {"i": 2, "r": 1.0, "b": true} >> >> 1. the client connects for the first time, gets schema, monitors _Server >> database. >> >> 2. the client requests a monitor_cond_since, id=3: >> method="monitor_cond_since", >> params=[ >> "idltest", >> ["monid","idltest"], >> > {..."simple":[{"where":[false],"columns":["b","ba","i","ia","r","ra","s","sa","u","ua"]}],...}, >> last_id=0 // last seen transaction id >> ], id=3 >> >> 3. the server doesn't need to send any updates, nothing matches "false" >> and replies, id=3: >> result=[ >> false, // requested "last_id" not found in history, will cause the >> // client to flush in-memory contents >> "13220f46-f13f-4f30-ab19-0af58699e0aa", // last known transaction ID, >> // the client will store this >> // locally in last_id >> {}], id=3 >> >> 4. the client changes its conditions and sends a monitor_cond_change, >> id=5: >> params=[ >> ["monid","idltest"],["monid","idltest"], >> {"simple":[{"where":[["i","==",2]]}]}], id=5 >> >> --- >> Client side view: >> - simple table: empty >> - last_id: "13220f46-f13f-4f30-ab19-0af58699e0aa" >> - acked-cond: "simple": false >> - req-cond: "simple": {"where":[["i","==",2]]} >> - new-cond: <none> >> --- >> >> 5. this new condition matches one row in table "simple" so the server >> sends an update which includes that row: >> method="update3", >> params=[ >> ["monid","idltest"], >> "13220f46-f13f-4f30-ab19-0af58699e0aa", >> > {"simple":{"fd500d06-8d0c-4149-9c69-2914538e07e5":{"insert":{"b":true,"r":1,"i":2}}}} >> ] >> >> 6. the client inserts the row in its in-memory copy of the database, >> last_id becomes "13220f46-f13f-4f30-ab19-0af58699e0aa". However, the >> condition change is not yet acked by the server! >> --- >> Client side view: >> - simple table: >> - row: {"b":true,"r":1,"i":2} >> - last_id: "13220f46-f13f-4f30-ab19-0af58699e0aa" >> - acked-cond: "simple": false >> - req-cond: "simple": {"where":[["i","==",2]]} >> - new-cond: <none> >> --- >> >> 7. the client changes its conditions, doesn't send a monitor_cond_change >> because there's already one in-flight, not yet acked. >> --- >> Client side view: >> - simple table: >> - row: {"b":true,"r":1,"i":2} >> - last_id: "13220f46-f13f-4f30-ab19-0af58699e0aa" >> - acked-cond: "simple": false >> - req-cond: "simple": {"where":[["i","==",2]]} >> - new-cond: "simple": {"where":[["i","==",1]]} >> --- >> >> 8. the connection goes down, ovsdb_cs_db_sync_condition() is called from >> ovsdb_cs_restart_fsm(): >> --- >> Client side view: >> - simple table: >> - row: {"b":true,"r":1,"i":2} >> - last_id: "13220f46-f13f-4f30-ab19-0af58699e0aa" >> - acked-cond: "simple": false >> - req-cond: <none> >> - new-cond: "simple": {"where":[["i","==",1]]} >> --- >> >> 9. the client reconnects, gets schema, monitors _Server database, a new >> monitor is setup on the server side. >> >> 10. the client requests a monitor_cond_since, id=9: >> method="monitor_cond_since", >> params=[ >> "idltest", >> ["monid","idltest"], >> > {..."simple":[{"where":[false],"columns":["b","ba","i","ia","r","ra","s","sa","u","ua"]}],...}, > // last acked condition! >> "13220f46-f13f-4f30-ab19-0af58699e0aa" // last seen transaction id >> ], id=9 >> >> 11. the server processes the condition change, but it changed to "false" >> so it shouldn't send any updates to the client because in the server's >> view the client didn't receive any rows yet, so there's nothing to >> delete. However, the last_id is _known_ by the server so it replies >> with: >> result=[ >> true, // last_id was found so the client will NOT flush its >> // in-memory db contents! >> "13220f46-f13f-4f30-ab19-0af58699e0aa", // last_id >> {}], id=9 >> >> 12. At this point the client side view is inconsistent, the row doesn't >> match the acked condition and there's no other in-flight condition >> change request: >> --- >> Client side view (INCONSISTENT): >> - simple table: >> - row: {"b":true,"r":1,"i":2} >> - last_id: "13220f46-f13f-4f30-ab19-0af58699e0aa" >> - acked-cond: "simple": false >> - req-cond: <none> >> - new-cond: "simple": {"where":[["i","==",1]]} >> --- >> >> 13. the client starts a transaction to update the row with i==1 and >> change "r" to 2, id=11: >> method="transact", >> params=[ >> "idltest", >> {"where":[["i","==",1]],"row":{"r":2},"op":"update","table":"simple"} >> ], id=11 >> >> 14. the server processes the transaction, the server sides contents > become: >> "row": {"i": 1, "r": 2.0, "b": true} >> "row": {"i": 2, "r": 1.0, "b": true} >> >> The updated row doesn't match the client's condition (false) so the > server only sends the txn result: >> reply, result=[{"count":1}], id=11 >> >> 15. the client sends the new conditions (were stored in new-cond), sends >> a monitor_cond_change, id=12: >> method="monitor_cond_change", >> params=[ >> ["monid","idltest"],["monid","idltest"], >> {"simple":[{"where":[["i","==",1]]}]}], >> id=12 >> >> 16. the server processes the new condition, now the second row matches >> the condition so the server sends the update3 message: >> method="update3", >> params=[ >> ["monid","idltest"], >> "ca9454d0-c013-440a-9c56-1dd2a4693bf7", >> > {"simple":{"f97bfbe1-1911-4d6f-89a1-b211cf3bb5dc":{"insert":{"b":true,"r":2,"i":1}}}} >> ] >> >> 17. the server acks the condition change: >> reply, result={}, id=12 >> --- >> Client side view (INCONSISTENT): >> - simple table: >> - row: {"b":true,"r":1,"i":2} >> - row: {"b":true,"r":2,"i":1} >> - last_id: "ca9454d0-c013-440a-9c56-1dd2a4693bf7" >> - acked-cond: "simple": {"where":[["i","==",1]]} >> - req-cond: <none> >> - new-cond: <none> >> --- >> >> The test fails here because it (correctly) expects the client to only >> have the second row in its in-memory view of the database. >> >> I'm not sure of a better way to summarize this but I'm open to >> suggestions about how to rephrase the commit message. >> > Thanks Dumitru for the details! Now I understand the problem. > So it is indeed an inconsistency between the client's monitor-condition and > its local data. The client thinks it has data of monitor condition A (the > last acked condition), but the data already reflects monitor condition B > (the unacked condition). Then any subsequent operations can lead to > inconsistent results, and of course inconsistent with the server side view > when the server creates a new session. > > The problem is in fact the data updates for the condition change and the > ACK message of condition change are not atomic. If we could combine these > two messages into one, then we wouldn't have this problem, right? That > would require a protocol change, and I don't see any obvious reason to keep > these as separate messages except for convenience probably.
Well, in normal operation mode, there is eventual consistency, which is fine. But in the corner case of reconnecting while conditions are being changed this breaks. And this breaks only if last_id is still known on the server side. I'm not sure how easy it would be to make the protocol change without breaking existing deployments. Moreover, as mentioned in the commit log, I don't think the scenario described above is a very common case and the performance impact of the fix should be minimal. > > I am ok with this patch if we don't want to fix the protocol now. > Acked-by: Han Zhou <[email protected]> > Thanks! _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
