skishtapuram-loyaltymethods opened a new issue, #3049:
URL: https://github.com/apache/helix/issues/3049

   ### Problem
   
   We are using Apache Helix to distribute a resource across multiple services 
- all registered under the same Helix cluster name and The system is deployed 
on AWS ECS Fargate using Docker containers.
   
   Each service uses the OnlineOffline state model, with 1 replica per 
partition and the FULL_AUTO rebalance mode.
   
   When all services start, Helix assigns partitions as expected and all 
services consume their assigned partitions. The problem occurs when a service 
is restarted - due to an AWS Spot Termination which appear to forcefully kill 
the container, bypassing the logic we have to handle SIGTERM
   
   So when a service is restarted from the cluster, the partition previously 
assigned to the now-stopped node are not reassigned to new live nodes. As a 
result, that partition remains Idle, even though other nodes are available and 
registered in live instances.
   
   I have tried with both Apache Helix version 0.9.9 and the latest 1.4.3, but 
the problem still persists in both versions. Also When a node is removed from 
the cluster, there are no logs from the controller instance indicating that it 
is attempting to rebalance the partitions.
   
   #### Additional Context After ZooKeeper Inspection
   * /LIVEINSTANCES is perfectly listing only actively running nodes.
   * But in /EXTERNALVIEW for that partition still shows the stopped node as 
ONLINE and does not reassign it to a live node.
   * This mismatch persists even after new nodes are registered as live 
instances.
   
   ### Expected behavior
   Helix should detect that the node is no longer live and automatically 
reassign its partitions to other available nodes in the cluster.
   ### Additional context
   Add any other context about the problem here.
   
   But I have been able to temporarily resolve the issue by performing a manual 
restart of all services register under that HELIX CLUSTER. After the manual 
restart, all partitions are perfectly reassigned.
   
   I will be active here and will respond to all your replies as soon as 
possible and I am happy to provide any additional information or logs that 
would help with debugging this issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@helix.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@helix.apache.org
For additional commands, e-mail: reviews-h...@helix.apache.org

Reply via email to