[ 
https://issues.apache.org/jira/browse/HBASE-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530313#comment-13530313
 ] 

nkeywal commented on HBASE-7327:
--------------------------------

I've got some doubts on TestMasterFailover.
The way the code is written on a master failover is to look for what is in zk, 
and, if the regionserver is down, force a reassign, if not, put it in the RIT 
list.

Many tests in TestMasterFailover put a given state in ZK, but keep the 
regionserver up. This way, it's actually the timeout that is managing the 
region status. It's fast because the timeout is set to a few seconds. But we 
should have a test with a real failover, with standard cases, and they should 
be fast without setting a timeout to 2 seconds or so.

So:
- this test shows a specific usage of the timeout: being a garbage collector 
when we put ourselves in an unexpected situation
- doesn't prove that we're effectively recovering quickly when we have a master 
failover, because the very short timeout hides the problem.

As an example, it seems that if the master fails just after creating a offline 
znode (before contacting the region server), we need the timeout to recover the 
region (i.e. 10 minutes). If confirmed (I will recheck tomorrow), it would be a 
bug (not that simple to fix actually), but we don't see it because of this 
short timeout.


And so, I'm thinking about:
- refactoring the tests to express the tests that can occurs during a master 
failover (including a region server crash, but may be it does exist already)
- keeping the timeout, but as a security only, without doing anything if it's 
allocated to a live region server. May be we will need extra cases here, I need 
to study the code more.
- May be add extra code if we identify a region opening for too long on a live 
server: calling it to check its status, release it or something alike. To be 
discussed :-)

                
> Assignment Timeouts: Remove the code from the master
> ----------------------------------------------------
>
>                 Key: HBASE-7327
>                 URL: https://issues.apache.org/jira/browse/HBASE-7327
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 7327.v1.uncomplete.patch
>
>
> As per HBASE-7247...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to