[ 
https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546169#comment-13546169
 ] 

Jimmy Xiang commented on HBASE-7407:
------------------------------------

We have a DoNotRetryIOException.  Does this mean all other exceptions are 
retriable? Do we need a PleaseRetryException?

Why do we need another lock? The caller of processRegionsInTransition should 
already have the lock, right?

I thought about the issue again. I am fine with other changes. As to the 
assignment manager change, I think it is better to use the original logic which 
is much simpler.  One enhancement to the original logic we can do, is that we 
can time out those region transitions earlier so that timeout monitor can 
reassign them earlier, if needed.

One thing I need to point out is that this method is only called during a 
failure recovery (master and some region servers are died together).

Adding [~ram_krish], Ram, can you take a look?

                
> TestMasterFailover under tests some cases and over tests some others
> --------------------------------------------------------------------
>
>                 Key: HBASE-7407
>                 URL: https://issues.apache.org/jira/browse/HBASE-7407
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment, test
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 7407.v1.patch, 7407.v2.patch, 7407.v3.patch
>
>
> The tests are done with this settings:
>     conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000);
>     conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000);
> As a results:
> 1) some tests seems to work, but in real life, the recovery would take 5 
> minutes or more, as in production there always higher. So we don't see the 
> real issues.
> 2) The tests include specific cases that should not happen in production. It 
> works because the timeout catches everything, but these scenarios do not need 
> to be optimized, as they cannot happen. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to