[ 
https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870919#comment-15870919
 ] 

stack commented on HBASE-17653:
-------------------------------

[~toffer] Please review. Sorry for making you work.

> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> ----------------------------------------------------------------
>
>                 Key: HBASE-17653
>                 URL: https://issues.apache.org/jira/browse/HBASE-17653
>             Project: HBase
>          Issue Type: Bug
>          Components: rsgroup
>            Reporter: stack
>            Assignee: stack
>         Attachments: HBASE-17653.master.001.patch
>
>
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access 
> to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes 
> scenario under which we  may end up in a deadlock (distributed). Let me 
> repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 
> holds monitor lock)
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock 
> but can't. rpc thread task won't release monitor lock since rsgroup region 
> never gets assigned. rsgroup region never gets assigned because it can't 
> update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and 
> meta region onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have 
> writes block reads.
> ....
> In the code prior to this patch the getter methods that retrieve group 
> information (getRSGroup, ofTable, OfServer, etc) don't require the monitor 
> lock so the deadlock cycle is broken.
> ....
> The methods that does mutations and updates to zk and hbase:rsgroup are 
> synchronized appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to