[ 
https://issues.apache.org/jira/browse/HBASE-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540955#comment-16540955
 ] 

Duo Zhang commented on HBASE-20846:
-----------------------------------

{quote}
Do we need a new boolean? What cases do we need to handle on load post-crash? 
To my mind, the procedure that holds a lock for the life of the procedure is 
the only case that needs re-institution?
{quote}

Yes, this is the truth for now, holdLock is true, and wasExecuted is true, then 
we give it the lock when loading procedures. But there is still a problem, as 
we will only persist the wasExecuted state into the procedure store after we 
finish at least one step of the procedure. So it is possible that, a procedure 
has acquired the lock, and done some modifications, and then we crashed. On 
restart, we will not give the lock to it, and another procedure may jump it. We 
have already done one step, and we expect that no one can interrupt us, so 
obviously this behavior is broken.

And when reviewing the table related procedures, I think we should make 
holdLock depend on the state of the procedure. For table related procedure, 
ideally we should acquire the exclusive lock all the time, until we schedule 
the reopen regions procedure. If not there will be problem, think of two 
ModifyTableProcedures run at the same time and both want to modify the table 
descriptor and layout... For now, this is done by a very tricky way - we always 
put the already being executed table procedure to the front of the queue, even 
in the yield method. This is not a good design as it will be easily broken in 
the future development.

> Table's shared lock is not held by sub-procedures after master restart
> ----------------------------------------------------------------------
>
>                 Key: HBASE-20846
>                 URL: https://issues.apache.org/jira/browse/HBASE-20846
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>             Fix For: 3.0.0, 2.0.2, 2.1.1
>
>         Attachments: HBASE-20846.branch-2.0.002.patch, 
> HBASE-20846.branch-2.0.patch
>
>
> Found this one when investigating ModifyTableProcedure got stuck while there 
> was a MoveRegionProcedure going on after master restart.
> Though this issue can be solved by HBASE-20752. But I discovered something 
> else.
> Before a MoveRegionProcedure can execute, it will hold the table's shared 
> lock. so,, when a UnassignProcedure was spwaned, it will not check the 
> table's shared lock since it is sure that its parent(MoveRegionProcedure) has 
> aquired the table's lock.
> {code:java}
> // If there is parent procedure, it would have already taken xlock, so no 
> need to take
>       // shared lock here. Otherwise, take shared lock.
>       if (!procedure.hasParent()
>           && waitTableQueueSharedLock(procedure, table) == null) {
>           return true;
>       }
> {code}
> But, it is not the case when Master was restarted. The child 
> procedure(UnassignProcedure) will be executed first after restart. Though it 
> has a parent(MoveRegionProcedure), but apprently the parent didn't hold the 
> table's lock.
> So, since it began to execute without hold the table's shared lock. A 
> ModifyTableProcedure can aquire the table's exclusive lock and execute at the 
> same time. Which is not possible if the master was not restarted.
> This will cause a stuck before HBASE-20752. But since HBASE-20752 has fixed, 
> I wrote a simple UT to repo this case.
> I think we don't have to check the parent for table's shared lock. It is a 
> shared lock, right? I think we can acquire it every time we need it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to