Lokesh Khurana created PHOENIX-7870:
---------------------------------------

             Summary: GetClusterRoleRecordUtil: per-HA-group poller futures + 
url1/url2 alternation
                 Key: PHOENIX-7870
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7870
             Project: Phoenix
          Issue Type: Sub-task
            Reporter: Lokesh Khurana
            Assignee: Lokesh Khurana


GetClusterRoleRecordUtil has two related bugs in its non-active poller logic.

  Bug 1: Cross-group cancel collision via shared static pollerFuture

  The class declares a single static volatile ScheduledFuture<?> pollerFuture 
field that is overwritten by every call to schedulePoller, regardless of the HA 
group name. The companion
  schedulerMap is correctly keyed by HA group name, but the future itself is 
not. When two HA groups poll concurrently, the second group's schedulePoller 
overwrites pollerFuture with
  its own future. The first group's later cancel-on-active branch then calls 
pollerFuture.cancel(false), cancelling the wrong group's future. The first 
group's poller is left orphaned:
  still running on the scheduler, but no longer tracked, so it can never be 
cancelled cleanly. The affected group's CRR cache stops refreshing and the 
client routes at the last-known
  active even after the operator promotes a new active.

  Bug 2: Poller pins to a single URL with no alternation or peer fallback

  schedulePoller accepts a single url parameter and the polling lambda closes 
over it. Every tick calls getClusterRoleRecord(url, ...) against the same URL. 
There is no alternation
  between url1 and url2, and no fallback on SQLException. If the cluster behind 
the bound URL becomes unreachable after the poller starts, every tick throws 
and the poller never escapes
   — no peer-side check happens, even when the peer cluster is healthy and 
would correctly report the new role.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to