Aleksandr Chesnokov created IGNITE-27962:
--------------------------------------------
Summary: IgniteLock may hang forever in busy-wait during node stop
Key: IGNITE-27962
URL: https://issues.apache.org/jira/browse/IGNITE-27962
Project: Ignite
Issue Type: Bug
Affects Versions: 2.17
Reporter: Aleksandr Chesnokov
*Summary*
Ignite 2.17.0: Distributed reentrant lock may hang forever during Kubernetes
rolling upgrade when {{{}failoverSafe=true{}}}.
*Environment*
* Apache Ignite 2.17.0
* Kubernetes deployment
* Backend pods = Ignite server nodes
* Frontend pods = thick clients
* Rolling upgrade (server nodes restarted one-by-one)
*Problem*
During a rolling upgrade, a thick client may hang indefinitely while calling:
* {{lock()}}
* {{unlock()}}
* {{tryLock(timeout)}}
This happens when a server node is stopped at a specific moment during lock
acquisition.
*Expected behavior*
When a server node leaves the cluster, the client should either:
* recover automatically, or
* fail fast with an exception.
{{tryLock(timeout)}} should respect the timeout.
*Actual behavior*
The client thread enters a busy-wait loop inside {{GridCacheLockImpl}} and
never resumes.
Ignite does not recover from this state.
All other threads trying to acquire the same lock also become blocked, leading
to full system degradation. The client must be restarted.
*Suspected root cause*
Lock acquisition flow:
# Node A calls {{{}ignite.reentrantLock(..., failoverSafe=true).lock(){}}}.
# Node A commits a pessimistic transaction to acquire the lock.
# Node A enters a busy-wait loop waiting for an “ack” message.
# The ack is delivered via a continuous query update (observed as
{{{}TOPIC_CONTINUOUS{}}}).
# If Node B (responsible for sending this update) is stopped via
{{Ignition.stop(..., cancel=true)}} before sending the message, the ack is
never emitted.
# Node A remains in the busy-wait loop forever.
The same issue may occur during {{{}unlock(){}}}.
*Additional notes*
* {{failoverSafe=true}} does not prevent the issue.
* Happens with both {{ShutdownPolicy.IMMEDIATE}} and {{{}GRACEFUL{}}}.
* Cluster uses Kubernetes headless service for discovery.
*Impact*
Critical. Causes indefinite hang and complete degradation of lock-related
operations.
*Source*
Reported on the Ignite user mailing list, 24 Feb 2026
https://lists.apache.org/thread/tyz91fskkt9klmpyn1jn249myvpzt8l0
--
This message was sent by Atlassian Jira
(v8.20.10#820010)