[ 
https://issues.apache.org/jira/browse/HBASE-20657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498727#comment-16498727
 ] 

Sergey Soldatov commented on HBASE-20657:
-----------------------------------------

Let's consider the case that [[email protected]] found. Attaching the test case 
to reproduce the problem.
What happens:
1. First ModifyTableProcedure (MTP) gets the exclusive lock, performs all meta 
modifications, creates subprocedures to reopen all regions. All those 
subprocedures are going to TableQueue. After that, it releases the lock. 
2. Second MTP gets the exclusive lock, performs all meta modifications and 
tries to create subprocedures. Since during the reopen we create 
MoveRegionProcedure, we immediately fail in the constructor of 
MoveRegionProcedure, because of some regions may be closed at that moment 
because of the first MTP. 
3. Since the lock was not released in (2), the scheduler will continuously try 
to execute the second MTP, failing over and over again. 
[~stack]  that's the problem I've tried to describe in the discussion in 
HBASE-20202. There are several directions (not ways, because none of them 
actually work) to solve it:
1. remove the call of openRegion in MoveRegionProcedure constructor. Still will 
have a problem that during the execution of subprocedures of the first MTP, 
metadata will be updated with new values from the second MTP  and would start 
failing during the open regions (like in the example - the report will be that 
compression is incorrect).
2. make MTP holding a lock. Doesn't work as well. Even for a single MTP call, 
it gets stuck if there is more than one worker for proc-v2. Reason - 
MasterProcedureScheduler#doPoll :
https://github.com/apache/hbase/blob/74ef118e9e2246c09280ebb7eb6552ef91bdd094/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java#L200
Since MTP creates 3 levels of subprocedures, due the concurrent execution, the 
queue may have a mix of subprocedures that have different parents and all 
workers may remove that table queue from the fairqueue at the same time, so 
there will be nothing to execute for them and some regions would get stuck in 
RiT state. Not sure, but there is a chance that this is a side effect of 
HBASE-20000, so FYI [~Apache9] 

Another question is related to (1) from the description. Is it expected, that 
during the retry from client we generate a new nonce key for the same 
procedure? 

Any thoughts, suggestions or directions are very appreciated. 



Also FYI [[email protected]], [~elserj]

> Retrying RPC call for ModifyTableProcedure may get stuck
> --------------------------------------------------------
>
>                 Key: HBASE-20657
>                 URL: https://issues.apache.org/jira/browse/HBASE-20657
>             Project: HBase
>          Issue Type: Bug
>          Components: Client, proc-v2
>    Affects Versions: 2.0.0
>            Reporter: Sergey Soldatov
>            Assignee: Sergey Soldatov
>            Priority: Major
>         Attachments: HBASE-20657-testcase-branch2.patch
>
>
> Env: 2 masters, 1 RS. 
> Steps to reproduce: Active master is killed while ModifyTableProcedure is 
> executed. 
> If the table has enough regions it may come that when the secondary master 
> get active some of the regions may be closed, so once client retries the call 
> to the new active master, a new ModifyTableProcedure is created and get stuck 
> during MODIFY_TABLE_REOPEN_ALL_REGIONS state handling. That happens because:
> 1. When we are retrying from client side, we call modifyTableAsync which 
> create a procedure with a new nonce key:
> {noformat}
>          ModifyTableRequest request = 
> RequestConverter.buildModifyTableRequest(
>             td.getTableName(), td, ng.getNonceGroup(), ng.newNonce());
> {noformat}
>  So on the server side, it's considered as a new procedure and starts 
> executing immediately.
> 2. When we are processing  MODIFY_TABLE_REOPEN_ALL_REGIONS we create 
> MoveRegionProcedure for each region, but it checks whether the region is 
> online (and it's not), so it fails immediately, forcing the procedure to 
> restart.
> [[email protected]] saw a similar case when two concurrent ModifyTable 
> procedures were running and got stuck in the similar way. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to