[
https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack resolved HBASE-17653.
---------------------------
Resolution: Fixed
Hadoop Flags: Reviewed
Fix Version/s: 2.0.0
Pushed to master. Thanks for review [~toffer]
> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> ----------------------------------------------------------------
>
> Key: HBASE-17653
> URL: https://issues.apache.org/jira/browse/HBASE-17653
> Project: HBase
> Issue Type: Bug
> Components: rsgroup
> Reporter: stack
> Assignee: stack
> Fix For: 2.0.0
>
> Attachments: HBASE-17653.master.001.patch,
> HBASE-17653.master.002.patch, HBASE-17653.master.003.patch
>
>
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access
> to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes
> scenario under which we may end up in a deadlock (distributed). Let me
> repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1
> holds monitor lock)
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock
> but can't. rpc thread task won't release monitor lock since rsgroup region
> never gets assigned. rsgroup region never gets assigned because it can't
> update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and
> meta region onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have
> writes block reads.
> ....
> In the code prior to this patch the getter methods that retrieve group
> information (getRSGroup, ofTable, OfServer, etc) don't require the monitor
> lock so the deadlock cycle is broken.
> ....
> The methods that does mutations and updates to zk and hbase:rsgroup are
> synchronized appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)