[
https://issues.apache.org/jira/browse/HBASE-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530313#comment-13530313
]
nkeywal commented on HBASE-7327:
--------------------------------
I've got some doubts on TestMasterFailover.
The way the code is written on a master failover is to look for what is in zk,
and, if the regionserver is down, force a reassign, if not, put it in the RIT
list.
Many tests in TestMasterFailover put a given state in ZK, but keep the
regionserver up. This way, it's actually the timeout that is managing the
region status. It's fast because the timeout is set to a few seconds. But we
should have a test with a real failover, with standard cases, and they should
be fast without setting a timeout to 2 seconds or so.
So:
- this test shows a specific usage of the timeout: being a garbage collector
when we put ourselves in an unexpected situation
- doesn't prove that we're effectively recovering quickly when we have a master
failover, because the very short timeout hides the problem.
As an example, it seems that if the master fails just after creating a offline
znode (before contacting the region server), we need the timeout to recover the
region (i.e. 10 minutes). If confirmed (I will recheck tomorrow), it would be a
bug (not that simple to fix actually), but we don't see it because of this
short timeout.
And so, I'm thinking about:
- refactoring the tests to express the tests that can occurs during a master
failover (including a region server crash, but may be it does exist already)
- keeping the timeout, but as a security only, without doing anything if it's
allocated to a live region server. May be we will need extra cases here, I need
to study the code more.
- May be add extra code if we identify a region opening for too long on a live
server: calling it to check its status, release it or something alike. To be
discussed :-)
> Assignment Timeouts: Remove the code from the master
> ----------------------------------------------------
>
> Key: HBASE-7327
> URL: https://issues.apache.org/jira/browse/HBASE-7327
> Project: HBase
> Issue Type: Improvement
> Components: master
> Affects Versions: 0.96.0
> Reporter: nkeywal
> Assignee: nkeywal
> Attachments: 7327.v1.uncomplete.patch
>
>
> As per HBASE-7247...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira