[
https://issues.apache.org/jira/browse/HBASE-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579765#comment-16579765
]
Allan Yang commented on HBASE-20978:
------------------------------------
[~stack], I think I have found the cause of this issue. In my test env, I
reproduced this issue:
1. The child is finished for pid=1207
{code}
2018-08-05 02:20:14,516 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=298,queue=28,port=60000]
assignment.RegionTransitionProcedure(275): Received report OPENED seqId=29089,
pid=1233, ppid=1207, state=RUN
NABLE:REGION_TRANSITION_DISPATCH, hasLock=true; AssignProcedure
table=IntegrationTestBigLinkedList, region=748c716b5490de5d10b83cc3174df533,
target=e010125048016.bja,60020,1533404434004; rit=OPENING, l
ocation=e010125048016.bja,60020,1533404434004
2018-08-05 02:20:14,516 DEBUG [PEWorker-3]
assignment.RegionTransitionProcedure(365): Finishing pid=1233, ppid=1207,
state=RUNNABLE:REGION_TRANSITION_FINISH, hasLock=true; AssignProcedure
table=Integra
tionTestBigLinkedList, region=748c716b5490de5d10b83cc3174df533,
target=e010125048016.bja,60020,1533404434004; rit=OPENING,
location=e010125048016.bja,60020,1533404434004
{code}
2. Before it can wake up its parent(pid=1207), the master get killed, when
restarting, the procedure with pid=1207 was recovered and submit for executing.
Notice that the procedure's state is WAITING
{code}
2018-08-05 02:20:30,805 DEBUG [Thread-15] procedure2.ProcedureExecutor(456):
Loading pid=1207, ppid=1206, state=WAITING, hasLock=false; MoveRegionProcedure
hri=748c716b5490de5d10b83cc3174df533, source=
e010125048016.bja,60020,1533404434004,
destination=e010125048016.bja,60020,1533404434004
2018-08-05 02:20:30,818 DEBUG [Thread-15] procedure2.Procedure(898): pid=1207,
ppid=1206, state=WAITING, hasLock=false; MoveRegionProcedure
hri=748c716b5490de5d10b83cc3174df533, source=e010125048016.bj
a,60020,1533404434004, destination=e010125048016.bja,60020,1533404434004 held
the lock before restarting, call acquireLock to restore it.
{code}
3. when executing, procedure executor detect the abnormal that we are executing
a procedure not with a RUNNABLE state
{code}
2018-08-05 02:20:36,220 WARN [PEWorker-9] procedure2.ProcedureExecutor(1801):
Worker terminating UNNATURALLY null
java.lang.IllegalArgumentException: pid=1207, ppid=1206, state=WAITING,
hasLock=true; MoveRegionProcedure hri=748c716b5490de5d10b83cc3174df533,
source=e010125048016.bja,60020,1533404434004, destination
=e010125048016.bja,60020,1533404434004
at
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1482)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1278)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1785)
{code}
So, the problem for this issue is that, for normal cases, the WAITING procedure
is waken by its children and will change its state from WAITING to RUNALBE in
Procedure.tryRunnable(). But in this case, since all the children are already
finished before master restart, we don't have a child to transfer the state for
us, so we need to set the WAITING procedure's state to RUNNABLE before
executing.
Uploaded a patch to fix it.
> [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure
> ----------------------------------------------------------------
>
> Key: HBASE-20978
> URL: https://issues.apache.org/jira/browse/HBASE-20978
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 2.0.1
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.2
>
>
> Testing tip of branch-2.0, ran into this:
> {code}
> 2018-07-29 01:45:33,002 INFO [master/ve0524:16000] master.HMaster: Master
> has completed initialization 13.854sec
> 2018-07-29
> 01:45:33,003 INFO [PEWorker-4] procedure.MasterProcedureScheduler: pid=1820,
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure
> hri=533fb79ba23b27e9e0715b51daeb30c1,
> source=ve0538.halxg.cloudera.com,16020,1532847421672,
> destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on
> 533fb79ba23b27e9e0715b51daeb30c1
> 2018-07-29 01:45:33,003
> WARN [PEWorker-4] procedure2.ProcedureExecutor: Worker terminating
> UNNATURALLY null
> java.lang.IllegalArgumentException: pid=1820,
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure
> hri=533fb79ba23b27e9e0715b51daeb30c1,
> source=ve0538.halxg.cloudera.com,16020,1532847421672,
> destination=ve0540.halxg.cloudera.com,16020,1532853151031
> at
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
>
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)
>
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
> {code}
> It then shows as the below in the UI:
> {code}
> Id Parent State Owner Type Start Time Last Update Errors
> Parameters
> 1820 WAITING stack MoveRegionProcedure
> hri=533fb79ba23b27e9e0715b51daeb30c1,
> source=ve0538.halxg.cloudera.com,16020,1532847421672,
> destination=ve0540.halxg.cloudera.com,16020,1532853151031 Sun Jul 29
> 01:33:37 PDT 2018 Sun Jul 29 01:33:38 PDT 2018 [ { state => [
> '1', '2' ] }, { regionId => '1532851768240', tableName => { namespace =>
> 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' },
> startKey => 'VttDLvXHdcmzwqNdrNoUFg==', endKey => 'WGFV8k+hFqhcIJGiKZ8L4Q==',
> offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => {
> hostName => 've0538.halxg.cloudera.com', port => '16020', startCode =>
> '1532847421672' }, destinationServer => { hostName =>
> 've0540.halxg.cloudera.com', port => '16020', startCode => '1532853151031' }
> } ]
> {code}
> This is what we'd just read from hbase:meta:
> {code}
> 2018-07-29 01:45:32,802 INFO [master/ve0524:16000]
> assignment.RegionStateStore: Load hbase:meta entry
> region=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED,
> lastHost=ve0538.halxg.cloudera.com,16020,1532847421672,
> regionLocation=ve0538.halxg.cloudera.com,16020,1532847421672,
> openSeqNum=1544600
> {code}
> Before this, we'd just logged this:
> 2018-07-29 01:33:39,786 INFO [PEWorker-14] assignment.RegionStateStore:
> pid=1823 updating hbase:meta row=533fb79ba23b27e9e0715b51daeb30c1,
> regionState=CLOSED
> Going back in history, we do the above each time the Master gets restarted so
> the region is offlined and never brought back online.
> It is failing here:
> {code}
> private void execProcedure(final RootProcedureState procStack,
> final Procedure<TEnvironment> procedure) {
> Preconditions.checkArgument(procedure.getState() ==
> ProcedureState.RUNNABLE,
> procedure.toString());
> {code}
> Its the parent move region that is trying to run and failing. It is not
> RUNNABLE? Because the subprocedure was 'done' but not fully?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)