[
https://issues.apache.org/jira/browse/HBASE-20846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540955#comment-16540955
]
Duo Zhang commented on HBASE-20846:
-----------------------------------
{quote}
Do we need a new boolean? What cases do we need to handle on load post-crash?
To my mind, the procedure that holds a lock for the life of the procedure is
the only case that needs re-institution?
{quote}
Yes, this is the truth for now, holdLock is true, and wasExecuted is true, then
we give it the lock when loading procedures. But there is still a problem, as
we will only persist the wasExecuted state into the procedure store after we
finish at least one step of the procedure. So it is possible that, a procedure
has acquired the lock, and done some modifications, and then we crashed. On
restart, we will not give the lock to it, and another procedure may jump it. We
have already done one step, and we expect that no one can interrupt us, so
obviously this behavior is broken.
And when reviewing the table related procedures, I think we should make
holdLock depend on the state of the procedure. For table related procedure,
ideally we should acquire the exclusive lock all the time, until we schedule
the reopen regions procedure. If not there will be problem, think of two
ModifyTableProcedures run at the same time and both want to modify the table
descriptor and layout... For now, this is done by a very tricky way - we always
put the already being executed table procedure to the front of the queue, even
in the yield method. This is not a good design as it will be easily broken in
the future development.
> Table's shared lock is not held by sub-procedures after master restart
> ----------------------------------------------------------------------
>
> Key: HBASE-20846
> URL: https://issues.apache.org/jira/browse/HBASE-20846
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 2.1.0
> Reporter: Allan Yang
> Assignee: Allan Yang
> Priority: Major
> Fix For: 3.0.0, 2.0.2, 2.1.1
>
> Attachments: HBASE-20846.branch-2.0.002.patch,
> HBASE-20846.branch-2.0.patch
>
>
> Found this one when investigating ModifyTableProcedure got stuck while there
> was a MoveRegionProcedure going on after master restart.
> Though this issue can be solved by HBASE-20752. But I discovered something
> else.
> Before a MoveRegionProcedure can execute, it will hold the table's shared
> lock. so,, when a UnassignProcedure was spwaned, it will not check the
> table's shared lock since it is sure that its parent(MoveRegionProcedure) has
> aquired the table's lock.
> {code:java}
> // If there is parent procedure, it would have already taken xlock, so no
> need to take
> // shared lock here. Otherwise, take shared lock.
> if (!procedure.hasParent()
> && waitTableQueueSharedLock(procedure, table) == null) {
> return true;
> }
> {code}
> But, it is not the case when Master was restarted. The child
> procedure(UnassignProcedure) will be executed first after restart. Though it
> has a parent(MoveRegionProcedure), but apprently the parent didn't hold the
> table's lock.
> So, since it began to execute without hold the table's shared lock. A
> ModifyTableProcedure can aquire the table's exclusive lock and execute at the
> same time. Which is not possible if the master was not restarted.
> This will cause a stuck before HBASE-20752. But since HBASE-20752 has fixed,
> I wrote a simple UT to repo this case.
> I think we don't have to check the parent for table's shared lock. It is a
> shared lock, right? I think we can acquire it every time we need it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)