[ https://issues.apache.org/jira/browse/HBASE-20657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505780#comment-16505780 ]
Xiaolin Ha commented on HBASE-20657: ------------------------------------ I have encountered the problem that two concurrent ModifyTableProcedures made RIT stuck. While the first one CLOSED region by its subprocedure UnassignProcedure,the second one's subprocedure MoveRegionProcedure will throw Ex when it checks whether the region is online. And what's more, the MoveRegionProcedure of the second one will be at the head of the TableQueue, because it is a subprocedure and ModifyTableProcedure's holdLock returns false. As a result, the first ModifyTableProcedure is interrupted by the second one, and the second ModifyTableProcedure is stuck at check online. Since ModifyTableProcedure's holdLock returns false, and it is not exclusive with region operation procedures, ModifyTableProcedure is easy to be interrupted by region procedures such as ModifyTableProcedure and stuck? > Retrying RPC call for ModifyTableProcedure may get stuck > -------------------------------------------------------- > > Key: HBASE-20657 > URL: https://issues.apache.org/jira/browse/HBASE-20657 > Project: HBase > Issue Type: Bug > Components: Client, proc-v2 > Affects Versions: 2.0.0 > Reporter: Sergey Soldatov > Assignee: stack > Priority: Critical > Fix For: 2.0.1 > > Attachments: HBASE-20657-1-branch-2.patch, > HBASE-20657-2-branch-2.patch, HBASE-20657-3-branch-2.patch, > HBASE-20657-testcase-branch2.patch > > > Env: 2 masters, 1 RS. > Steps to reproduce: Active master is killed while ModifyTableProcedure is > executed. > If the table has enough regions it may come that when the secondary master > get active some of the regions may be closed, so once client retries the call > to the new active master, a new ModifyTableProcedure is created and get stuck > during MODIFY_TABLE_REOPEN_ALL_REGIONS state handling. That happens because: > 1. When we are retrying from client side, we call modifyTableAsync which > create a procedure with a new nonce key: > {noformat} > ModifyTableRequest request = > RequestConverter.buildModifyTableRequest( > td.getTableName(), td, ng.getNonceGroup(), ng.newNonce()); > {noformat} > So on the server side, it's considered as a new procedure and starts > executing immediately. > 2. When we are processing MODIFY_TABLE_REOPEN_ALL_REGIONS we create > MoveRegionProcedure for each region, but it checks whether the region is > online (and it's not), so it fails immediately, forcing the procedure to > restart. > [~an...@apache.org] saw a similar case when two concurrent ModifyTable > procedures were running and got stuck in the similar way. -- This message was sent by Atlassian JIRA (v7.6.3#76005)