LZD-PratyushBhatt opened a new issue, #3063:
URL: https://github.com/apache/helix/issues/3063

   ### Describe the bug
   Helix participants can enter a "zombie" state where they appear healthy in 
ZooKeeper but are functionally disconnected from cluster operations, unable to 
process state transition messages. This occurs due to a critical flaw in the 
legacy compatibility wrapper IZkStateListenerI0ItecImpl that completely ignores 
session ID parameters when processing ZooKeeper session events.
   When the single-threaded ZkEventThread becomes blocked (e.g., during 
long-running FINALIZE operations), multiple SyncConnected events accumulate in 
the event queue. Once processing resumes, these events are handled in FIFO 
order, but the legacy wrapper discards the original session ID from each queued 
event and falls back to using the current active session ID via getSessionId(). 
This causes session events originally queued for older sessions to be processed 
using the context of the most recent session, violating the intended event 
processing timeline.
   The result is catastrophic: the first misprocessed event successfully 
creates a LiveInstance for the current session, but all subsequent events in 
the backlog attempt to create LiveInstances for the same current session and 
fail with "already has a live-instance" exceptions. Each failed attempt 
partially resets the participant's message handlers (setting _ready = false) 
but never completes the re-initialization process, leaving the participant in a 
broken state where it cannot process any state transition messages despite 
appearing active to the Helix controller.
   
   ### To Reproduce
   Create a DedicateZkClient, pass any fake sessionID in handleNewSession. It 
will totally discard that sessionID and will use the current active zk 
connection session ID and will all the processing based on that. Trigger 
another handleNewSession event with fake sessionID, it will fail to create 
liveinstance this time, and handlers will be kept in reset state.
   
   ### Expected behavior
   Session events should be processed with their original session context 
regardless of timing delays. The handleNewSession() method should receive and 
use the correct session ID that was active when the event was originally 
queued, ensuring proper FIFO event processing and maintaining participant 
message handling capabilities.
   
   ### Additional context
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to