[ 
https://issues.apache.org/jira/browse/HBASE-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580575#comment-16580575
 ] 

Duo Zhang commented on HBASE-21051:
-----------------------------------

I think ReopenTableRegionProcedure should not always hold the lock, and also 
should handle the possible NPE cases if the region is gone during the reopening.

But for other procedures, such as ModifyTableProcedure, we should make the 
return value of holdLock depend on the state of the procedure. For 
ModifyTableProcedure, I think we need to hold the exclusive lock all the time 
until we reach the last state, where we need to reopen all the regions.

There will be no problem if there is no master crashes, as in PE we have a fast 
path that executes the procedure without releasing the lock even if holdLock is 
false. So I haven't spent too much times on this topic yet.

> Possible NPE if ModifyTable and region split happen at the same time
> --------------------------------------------------------------------
>
>                 Key: HBASE-21051
>                 URL: https://issues.apache.org/jira/browse/HBASE-21051
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2
>    Affects Versions: 2.1.0, 2.0.1
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>
> Similar with HBASE-20921, ModifyTable procedure and reopenProcedure won't 
> held the lock, so another procedures like split/merge can execute at the same 
> time.
> 1. a split happend during ModifyTable, as you can see from the log, the split 
> was nealy complete.
> {code}
> 2018-08-05 01:28:31,339 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1659): Finished subprocedure(s) of pid=772, 
> state=RUNNABLE:SPLIT_TABLE_REGION_POST_OPERATION, hasLock=true; 
> SplitTableRegionProce
> dure table=IntegrationTestBigLinkedList, 
> parent=357a7a6a62c76bc2d7ab30a6cc812637, 
> daughterA=b13e5d155b65a5f752f3adda78fcfb6a, 
> daughterB=5be3aadcee68d91c3d1e464865550246; resume parent processing.
> 2018-08-05 01:28:31,345 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1296): Finished pid=795, ppid=772, 
> state=SUCCESS, hasLock=false; AssignProcedure 
> table=IntegrationTestBigLinkedList, region=b13e5
> d155b65a5f752f3adda78fcfb6a, target=e010125048016.bja,60020,1533402809226 in 
> 5.0280sec
> {code}
> 2. reopenProcedure began to reopen region by moving it
> {code}
> 2018-08-05 01:28:31,389 INFO  [PEWorker-11] 
> procedure.MasterProcedureScheduler(631): pid=781, ppid=774, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN, hasLock=false; MoveRegionProcedure 
> hri=357a7a6a62c76bc2d7ab3
> 0a6cc812637, source=e010125048016.bja,60020,1533402809226, 
> destination=e010125048016.bja,60020,1533402809226 checking lock on 
> 357a7a6a62c76bc2d7ab30a6cc812637
> 2018-08-05 01:28:31,390 INFO  [PEWorker-3] 
> procedure2.ProcedureExecutor(1296): Finished pid=772, state=SUCCESS, 
> hasLock=false; SplitTableRegionProcedure table=IntegrationTestBigLinkedList, 
> parent=357a7
> a6a62c76bc2d7ab30a6cc812637, daughterA=b13e5d155b65a5f752f3adda78fcfb6a, 
> daughterB=5be3aadcee68d91c3d1e464865550246 in 21.9050sec
> 2018-08-05 01:28:31,518 INFO  [PEWorker-11] 
> procedure2.ProcedureExecutor(1533): Initialized subprocedures=[{pid=797, 
> ppid=781, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=false; 
> UnassignProcedur
> e table=IntegrationTestBigLinkedList, 
> region=357a7a6a62c76bc2d7ab30a6cc812637, 
> server=e010125048016.bja,60020,1533402809226}]
> 2018-08-05 01:28:31,530 INFO  [PEWorker-15] 
> procedure.MasterProcedureScheduler(631): pid=797, ppid=781, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=false; UnassignProcedure 
> table=IntegrationTest
> BigLinkedList, region=357a7a6a62c76bc2d7ab30a6cc812637, 
> server=e010125048016.bja,60020,1533402809226 checking lock on 
> 357a7a6a62c76bc2d7ab30a6cc812637
> {code}
> 3. MoveRegionProcdure fails since the region did not exis any more (due to 
> split)
> {code}
> 2018-08-05 01:28:31,543 ERROR [PEWorker-15] 
> procedure2.ProcedureExecutor(1517): CODE-BUG: Uncaught runtime exception: 
> pid=797, ppid=781, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; 
> Unassig
> nProcedure table=IntegrationTestBigLinkedList, 
> region=357a7a6a62c76bc2d7ab30a6cc812637, 
> server=e010125048016.bja,60020,1533402809226
> java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1097)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1125)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1455)
>         at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:204)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:349)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:101)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:873)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1498)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1278)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1785)
> {code}
> We need to think about the case, and find a untimely solution for it, 
> otherwise, issues like this one and HBASE-20921 will keep comming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to