Konstantin Orlov created IGNITE-25363: -----------------------------------------
Summary: Sql. Delayed NODE_LEFT event processing may cause query to hung Key: IGNITE-25363 URL: https://issues.apache.org/jira/browse/IGNITE-25363 Project: Ignite Issue Type: Bug Components: sql ai3 Reporter: Konstantin Orlov This problem is highlighted by test {{org.apache.ignite.internal.runner.app.ItDataSchemaSyncTest#checkSchemasCorrectlyRestore}} which sometimes fails on TC with timeout. The sequence of events as follow: # Given: cluster of 3 nodes, distribution zone spans all these nodes. # Node 1 has been restarted. # Notification of {{org.apache.ignite.internal.network.TopologyEventHandler#onDisappeared}} handlers are delayed on node 2 (due to metastorage lagging or whatever reason). # Query started from node 1. # Root fragment processed locally, {{QueryBatchRequest}} came to node 2 before {{QueryStartRequest}}. This step is crucial since it puts not completed future to mailbox registry ({{org.apache.ignite.internal.sql.engine.exec.MailboxRegistryImpl#locals}}). # {{TopologyEventHandler}}'s are notified on node 2. This step causes {{onNodeLeft}} handler to be chained to the future from previous step. # {{QueryStartRequest}} came to node 2. Query fragment is created an immediately closed by {{onNodeLeft}} handler. The problem is that {{onNodeLeft}} handler is applied to a query started on a topology which takes into account node restart. We have to ignore such outdated events. -- This message was sent by Atlassian Jira (v8.20.10#820010)