[
https://issues.apache.org/jira/browse/IGNITE-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061029#comment-18061029
]
Aleksandr Chesnokov commented on IGNITE-27962:
----------------------------------------------
Link from reporter
> IgniteLock may hang forever in busy-wait during node stop
> ---------------------------------------------------------
>
> Key: IGNITE-27962
> URL: https://issues.apache.org/jira/browse/IGNITE-27962
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.17
> Reporter: Aleksandr Chesnokov
> Priority: Major
> Attachments: Deadlock_Scenario_Filtered (1).txt
>
>
> *Summary*
> Ignite 2.17.0: Distributed reentrant lock may hang forever during Kubernetes
> rolling upgrade when {{{}failoverSafe=true{}}}.
> *Environment*
> * Apache Ignite 2.17.0
> * Kubernetes deployment
> * Backend pods = Ignite server nodes
> * Frontend pods = thick clients
> * Rolling upgrade (server nodes restarted one-by-one)
> *Problem*
> During a rolling upgrade, a thick client may hang indefinitely while calling:
> * {{lock()}}
> * {{unlock()}}
> * {{tryLock(timeout)}}
> This happens when a server node is stopped at a specific moment during lock
> acquisition.
> *Expected behavior*
> When a server node leaves the cluster, the client should either:
> * recover automatically, or
> * fail fast with an exception.
> {{tryLock(timeout)}} should respect the timeout.
> *Actual behavior*
> The client thread enters a busy-wait loop inside {{GridCacheLockImpl}} and
> never resumes.
> Ignite does not recover from this state.
> All other threads trying to acquire the same lock also become blocked,
> leading to full system degradation. The client must be restarted.
> *Suspected root cause*
> Lock acquisition flow:
> # Node A calls {{{}ignite.reentrantLock(..., failoverSafe=true).lock(){}}}.
> # Node A commits a pessimistic transaction to acquire the lock.
> # Node A enters a busy-wait loop waiting for an “ack” message.
> # The ack is delivered via a continuous query update (observed as
> {{{}TOPIC_CONTINUOUS{}}}).
> # If Node B (responsible for sending this update) is stopped via
> {{Ignition.stop(..., cancel=true)}} before sending the message, the ack is
> never emitted.
> # Node A remains in the busy-wait loop forever.
> The same issue may occur during {{{}unlock(){}}}.
> *Additional notes*
> * {{failoverSafe=true}} does not prevent the issue.
> * Happens with both {{ShutdownPolicy.IMMEDIATE}} and {{{}GRACEFUL{}}}.
> * Cluster uses Kubernetes headless service for discovery.
> *Impact*
> Critical. Causes indefinite hang and complete degradation of lock-related
> operations.
> *Source*
> Reported on the Ignite user mailing list, 24 Feb 2026
> https://lists.apache.org/thread/tyz91fskkt9klmpyn1jn249myvpzt8l0
--
This message was sent by Atlassian Jira
(v8.20.10#820010)