supratimdeka opened a new pull request #852: HDDS-1454. GC other system pause events can trigger pipeline destroy for all the nodes in the cluster. Contributed by Supratim Deka URL: https://github.com/apache/hadoop/pull/852 https://issues.apache.org/jira/browse/HDDS-1454 Problem: In a MiniOzoneChaosCluster run it was observed that events like GC pauses or any other pauses in SCM can mark all the datanodes as stale in SCM. This will trigger multiple pipeline destroy and will render the system unusable. Solution: Added a timestamp check in NodeStateManager. If the heartbeat task detects a long scheduling delay since the last time it ran, then the task skips doing health checks and node state transitions in the current iteration. Test: The unit test simulates a JVM pause by simply pausing the iterations of the health check task. Once the health check task is "unpaused", the system condition will be similar to a JVM pause. The test asserts that any node with heartbeats should not transition to Stale or Dead after such a long delay in scheduling.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
