[
https://issues.apache.org/jira/browse/HBASE-13526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506050#comment-14506050
]
Ted Yu commented on HBASE-13526:
--------------------------------
lgtm
> TestRegionServerReportForDuty can be flaky: hang or timeout
> -----------------------------------------------------------
>
> Key: HBASE-13526
> URL: https://issues.apache.org/jira/browse/HBASE-13526
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 2.0.0, 1.1.0, 0.98.12
> Reporter: Jerry He
> Assignee: Jerry He
> Priority: Minor
> Attachments: HBASE-13526.patch
>
>
> This test case is from HBASE-13317.
> The test uses a custom region server to simulate reportForDuty in a master
> failover case. This custom RS would start, then the primary master would
> fail, then the custom RS would reportForDuty to the second master after
> master failover.
> The test occasionally will hang or timeout.
> The root cause is that during first master initialization, the master would
> assign meta (and create and assign namespace table). It is possible that the
> meta is assigned to the custom RS, which has started (place a rs node on the
> ZK), but will not really check-in and be online. Then the master will go thru
> multiple re-assignment, which can be lengthy and cause trouble.
> There are a couple of issues I see in the master assignment code:
> 1. Master puts all the region servers obtained from ZK rs node into the
> online server list, including those that have not checked-in via RPC. And we
> will assign meta or other regions based on whole list.
> 2. When one assign plan fails, we don't exclude the failed server when
> picking the next destination, which may prolong the assignment process.
> I will provide a patch to fix the test case. The other issues mentioned are
> up to discussion.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)