skishtapuram-loyaltymethods opened a new issue, #3049: URL: https://github.com/apache/helix/issues/3049
### Problem We are using Apache Helix to distribute a resource across multiple services - all registered under the same Helix cluster name and The system is deployed on AWS ECS Fargate using Docker containers. Each service uses the OnlineOffline state model, with 1 replica per partition and the FULL_AUTO rebalance mode. When all services start, Helix assigns partitions as expected and all services consume their assigned partitions. The problem occurs when a service is restarted - due to an AWS Spot Termination which appear to forcefully kill the container, bypassing the logic we have to handle SIGTERM So when a service is restarted from the cluster, the partition previously assigned to the now-stopped node are not reassigned to new live nodes. As a result, that partition remains Idle, even though other nodes are available and registered in live instances. I have tried with both Apache Helix version 0.9.9 and the latest 1.4.3, but the problem still persists in both versions. Also When a node is removed from the cluster, there are no logs from the controller instance indicating that it is attempting to rebalance the partitions. #### Additional Context After ZooKeeper Inspection * /LIVEINSTANCES is perfectly listing only actively running nodes. * But in /EXTERNALVIEW for that partition still shows the stopped node as ONLINE and does not reassign it to a live node. * This mismatch persists even after new nodes are registered as live instances. ### Expected behavior Helix should detect that the node is no longer live and automatically reassign its partitions to other available nodes in the cluster. ### Additional context Add any other context about the problem here. But I have been able to temporarily resolve the issue by performing a manual restart of all services register under that HELIX CLUSTER. After the manual restart, all partitions are perfectly reassigned. I will be active here and will respond to all your replies as soon as possible and I am happy to provide any additional information or logs that would help with debugging this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@helix.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@helix.apache.org For additional commands, e-mail: reviews-h...@helix.apache.org