supratimdeka opened a new pull request #852: HDDS-1454. GC other system pause 
events can trigger pipeline destroy for all the nodes in the cluster. 
Contributed by Supratim Deka
URL: https://github.com/apache/hadoop/pull/852
 
 
   https://issues.apache.org/jira/browse/HDDS-1454
   
   Problem:
   In a MiniOzoneChaosCluster run it was observed that events like GC pauses or 
any other pauses in SCM can mark all the datanodes as stale in SCM. This will 
trigger multiple pipeline destroy and will render the system unusable.
   
   Solution:
   Added a timestamp check in NodeStateManager. If the heartbeat task detects a 
long scheduling delay since the last time it ran, then the task skips doing 
health checks and node state transitions in the current iteration.
   
   Test:
   The unit test simulates a JVM pause by simply pausing the iterations of the 
health check task. Once the health check task is "unpaused", the system 
condition will be similar to a JVM pause. The test asserts that any node with 
heartbeats should not transition to Stale or Dead after such a long delay in 
scheduling.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to