Aleksandr Chesnokov created IGNITE-27962:
--------------------------------------------

             Summary: IgniteLock may hang forever in busy-wait during node stop
                 Key: IGNITE-27962
                 URL: https://issues.apache.org/jira/browse/IGNITE-27962
             Project: Ignite
          Issue Type: Bug
    Affects Versions: 2.17
            Reporter: Aleksandr Chesnokov


*Summary*
Ignite 2.17.0: Distributed reentrant lock may hang forever during Kubernetes 
rolling upgrade when {{{}failoverSafe=true{}}}.

*Environment*
 * Apache Ignite 2.17.0
 * Kubernetes deployment
 * Backend pods = Ignite server nodes
 * Frontend pods = thick clients
 * Rolling upgrade (server nodes restarted one-by-one)

*Problem*
During a rolling upgrade, a thick client may hang indefinitely while calling:
 * {{lock()}}
 * {{unlock()}}
 * {{tryLock(timeout)}}

This happens when a server node is stopped at a specific moment during lock 
acquisition.

*Expected behavior*
When a server node leaves the cluster, the client should either:
 * recover automatically, or
 * fail fast with an exception.

{{tryLock(timeout)}} should respect the timeout.

*Actual behavior*
The client thread enters a busy-wait loop inside {{GridCacheLockImpl}} and 
never resumes.
Ignite does not recover from this state.

All other threads trying to acquire the same lock also become blocked, leading 
to full system degradation. The client must be restarted.

*Suspected root cause*
Lock acquisition flow:
 # Node A calls {{{}ignite.reentrantLock(..., failoverSafe=true).lock(){}}}.
 # Node A commits a pessimistic transaction to acquire the lock.
 # Node A enters a busy-wait loop waiting for an “ack” message.
 # The ack is delivered via a continuous query update (observed as 
{{{}TOPIC_CONTINUOUS{}}}).
 # If Node B (responsible for sending this update) is stopped via 
{{Ignition.stop(..., cancel=true)}} before sending the message, the ack is 
never emitted.
 # Node A remains in the busy-wait loop forever.

The same issue may occur during {{{}unlock(){}}}.

*Additional notes*
 * {{failoverSafe=true}} does not prevent the issue.
 * Happens with both {{ShutdownPolicy.IMMEDIATE}} and {{{}GRACEFUL{}}}.
 * Cluster uses Kubernetes headless service for discovery.

*Impact*
Critical. Causes indefinite hang and complete degradation of lock-related 
operations.

*Source*
Reported on the Ignite user mailing list, 24 Feb 2026 
https://lists.apache.org/thread/tyz91fskkt9klmpyn1jn249myvpzt8l0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to