[
https://issues.apache.org/jira/browse/HBASE-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13703768#comment-13703768
]
Enis Soztutar commented on HBASE-8897:
--------------------------------------
Patch looks good overall.
- My only concern is that if the server is suspended, and somehow CM fails to
resume, or user aborts CM, the regionsrver will stay suspended until the
process is manually killed by the user. Not sure what the best way to fix this
would be. Maybe send resume signal in restoreClusterStatus() to every RS?
- In resumeRS() you resume, but wait for ONE_MIN instead of TIMEOUT. Is there
a reason. Restarting after resume should be fine.
- Resulting behavior after suspending for 1 min will depend on zk timeout.
Should we do two different actions, one for suspending for zkTimeout/2 and
other for zkTimeout + 5sec? These will test GC pauses and zk timeouts
separately.
- Changes for waitForRegionServerToStart() looks good. I guess it is needed to
make sure that a suspended RS is not thought to be alive.
- Can bring this to 0.94 as well.
> Add Suspend and Resume to Chaos Monkey
> --------------------------------------
>
> Key: HBASE-8897
> URL: https://issues.apache.org/jira/browse/HBASE-8897
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 0.98.0, 0.95.1
> Reporter: Elliott Clark
> Assignee: Elliott Clark
> Attachments: HBASE-8897-0.patch
>
>
> We should add suspend and resume to chaos monkey. They are already exposed
> through HBaseClusterManager.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira