Konstantin Orlov created IGNITE-25363:
-----------------------------------------

             Summary: Sql. Delayed NODE_LEFT event processing may cause query 
to hung
                 Key: IGNITE-25363
                 URL: https://issues.apache.org/jira/browse/IGNITE-25363
             Project: Ignite
          Issue Type: Bug
          Components: sql ai3
            Reporter: Konstantin Orlov


This problem is highlighted by test 
{{org.apache.ignite.internal.runner.app.ItDataSchemaSyncTest#checkSchemasCorrectlyRestore}}
 which sometimes fails on TC with timeout. The sequence of events as follow:
 # Given: cluster of 3 nodes, distribution zone spans all these nodes.
 # Node 1 has been restarted.
 # Notification of 
{{org.apache.ignite.internal.network.TopologyEventHandler#onDisappeared}} 
handlers are delayed on node 2 (due to metastorage lagging or whatever reason).
 # Query started from node 1.
 # Root fragment processed locally, {{QueryBatchRequest}} came to node 2 before 
{{QueryStartRequest}}. This step is crucial since it puts not completed future 
to mailbox registry 
({{org.apache.ignite.internal.sql.engine.exec.MailboxRegistryImpl#locals}}).
 # {{TopologyEventHandler}}'s are notified on node 2. This step causes 
{{onNodeLeft}} handler to be chained to the future from previous step.
# {{QueryStartRequest}} came to node 2. Query fragment is created an 
immediately closed by {{onNodeLeft}} handler.

The problem is that {{onNodeLeft}} handler is applied to a query started on a 
topology which takes into account node restart. We have to ignore such outdated 
events.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to