[jira] [Commented] (HBASE-13526) TestRegionServerReportForDuty can be flaky: hang or timeout

Ted Yu (JIRA) Tue, 21 Apr 2015 16:54:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506050#comment-14506050
 ]


Ted Yu commented on HBASE-13526:
--------------------------------

lgtm

> TestRegionServerReportForDuty can be flaky: hang or timeout
> -----------------------------------------------------------
>
>                 Key: HBASE-13526
>                 URL: https://issues.apache.org/jira/browse/HBASE-13526
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.0, 1.1.0, 0.98.12
>            Reporter: Jerry He
>            Assignee: Jerry He
>            Priority: Minor
>         Attachments: HBASE-13526.patch
>
>
> This test case is from HBASE-13317.
> The test uses a custom region server to simulate reportForDuty in a master 
> failover case.  This custom RS would start, then the primary master would 
> fail, then the custom RS would  reportForDuty to the second master after 
> master failover.
> The test occasionally will hang or timeout.
> The root cause is that during first master initialization, the master would 
> assign meta (and create and assign namespace table). It is possible that the 
> meta is assigned to the custom RS, which has started (place a rs node on the 
> ZK), but will not really check-in and be online. Then the master will go thru 
> multiple re-assignment, which can be lengthy and cause trouble.
> There are a couple of issues I see in the master assignment code:
> 1.  Master puts all the region servers obtained from ZK rs node into the 
> online server list, including those that have not checked-in via RPC.  And we 
> will assign meta or other regions based on whole list.
> 2. When one assign plan fails, we don't exclude the failed server when 
> picking the next destination, which may prolong the assignment process.
> I will provide a patch to fix the test case.  The other issues mentioned are 
> up to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13526) TestRegionServerReportForDuty can be flaky: hang or timeout

Reply via email to