[jira] [Commented] (HBASE-20978) [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure

Allan Yang (JIRA) Tue, 14 Aug 2018 05:56:16 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579765#comment-16579765
 ]


Allan Yang commented on HBASE-20978:
------------------------------------

[~stack], I think I have found the cause of this issue. In my test env, I 
reproduced this issue:

1. The child is finished for pid=1207
{code}
2018-08-05 02:20:14,516 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=298,queue=28,port=60000] 
assignment.RegionTransitionProcedure(275): Received report OPENED seqId=29089, 
pid=1233, ppid=1207, state=RUN
NABLE:REGION_TRANSITION_DISPATCH, hasLock=true; AssignProcedure 
table=IntegrationTestBigLinkedList, region=748c716b5490de5d10b83cc3174df533, 
target=e010125048016.bja,60020,1533404434004; rit=OPENING, l
ocation=e010125048016.bja,60020,1533404434004
2018-08-05 02:20:14,516 DEBUG [PEWorker-3] 
assignment.RegionTransitionProcedure(365): Finishing pid=1233, ppid=1207, 
state=RUNNABLE:REGION_TRANSITION_FINISH, hasLock=true; AssignProcedure 
table=Integra
tionTestBigLinkedList, region=748c716b5490de5d10b83cc3174df533, 
target=e010125048016.bja,60020,1533404434004; rit=OPENING, 
location=e010125048016.bja,60020,1533404434004
{code}

2. Before it can wake up its parent(pid=1207), the master get killed, when 
restarting, the procedure with pid=1207 was recovered and submit for executing. 
Notice that the procedure's state is WAITING
{code}
2018-08-05 02:20:30,805 DEBUG [Thread-15] procedure2.ProcedureExecutor(456): 
Loading pid=1207, ppid=1206, state=WAITING, hasLock=false; MoveRegionProcedure 
hri=748c716b5490de5d10b83cc3174df533, source=
e010125048016.bja,60020,1533404434004, 
destination=e010125048016.bja,60020,1533404434004

2018-08-05 02:20:30,818 DEBUG [Thread-15] procedure2.Procedure(898): pid=1207, 
ppid=1206, state=WAITING, hasLock=false; MoveRegionProcedure 
hri=748c716b5490de5d10b83cc3174df533, source=e010125048016.bj
a,60020,1533404434004, destination=e010125048016.bja,60020,1533404434004 held 
the lock before restarting, call acquireLock to restore it.
{code}

3. when executing, procedure executor detect the abnormal that we are executing 
a procedure not with a RUNNABLE state
{code}
2018-08-05 02:20:36,220 WARN  [PEWorker-9] procedure2.ProcedureExecutor(1801): 
Worker terminating UNNATURALLY null
java.lang.IllegalArgumentException: pid=1207, ppid=1206, state=WAITING, 
hasLock=true; MoveRegionProcedure hri=748c716b5490de5d10b83cc3174df533, 
source=e010125048016.bja,60020,1533404434004, destination
=e010125048016.bja,60020,1533404434004
        at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1482)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1278)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1785)
{code}

So, the problem for this issue is that, for normal cases, the WAITING procedure 
is waken by its children and will change its state from WAITING to RUNALBE in 
Procedure.tryRunnable(). But in this case, since all the children are already 
finished before master restart, we don't have a child to transfer the state for 
us, so we need  to set the WAITING procedure's state to RUNNABLE before 
executing.
Uploaded a patch to fix it.

> [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure
> ----------------------------------------------------------------
>
>                 Key: HBASE-20978
>                 URL: https://issues.apache.org/jira/browse/HBASE-20978
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.0.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.2
>
>
> Testing tip of branch-2.0, ran into this:
> {code}
> 2018-07-29 01:45:33,002 INFO  [master/ve0524:16000] master.HMaster: Master 
> has completed initialization 13.854sec                                        
>                                                                    2018-07-29 
> 01:45:33,003 INFO  [PEWorker-4] procedure.MasterProcedureScheduler: pid=1820, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on 
> 533fb79ba23b27e9e0715b51daeb30c1                                              
>                                                     2018-07-29 01:45:33,003 
> WARN  [PEWorker-4] procedure2.ProcedureExecutor: Worker terminating 
> UNNATURALLY null
> java.lang.IllegalArgumentException: pid=1820, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031
>   at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
>                                                                               
>                                      at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)
>                                                                               
>                                          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
> {code}
> It then shows as the below in the UI:
> {code}
> Id    Parent  State   Owner   Type    Start Time      Last Update     Errors  
> Parameters
> 1820          WAITING stack   MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031       Sun Jul 29 
> 01:33:37 PDT 2018    Sun Jul 29 01:33:38 PDT 2018            [ { state => [ 
> '1', '2' ] }, { regionId => '1532851768240', tableName => { namespace => 
> 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, 
> startKey => 'VttDLvXHdcmzwqNdrNoUFg==', endKey => 'WGFV8k+hFqhcIJGiKZ8L4Q==', 
> offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { 
> hostName => 've0538.halxg.cloudera.com', port => '16020', startCode => 
> '1532847421672' }, destinationServer => { hostName => 
> 've0540.halxg.cloudera.com', port => '16020', startCode => '1532853151031' } 
> } ]
> {code}
> This is what we'd just read from hbase:meta:
> {code}
> 2018-07-29 01:45:32,802 INFO  [master/ve0524:16000] 
> assignment.RegionStateStore: Load hbase:meta entry 
> region=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED, 
> lastHost=ve0538.halxg.cloudera.com,16020,1532847421672, 
> regionLocation=ve0538.halxg.cloudera.com,16020,1532847421672, 
> openSeqNum=1544600
> {code}
> Before this, we'd just logged this:
> 2018-07-29 01:33:39,786 INFO  [PEWorker-14] assignment.RegionStateStore: 
> pid=1823 updating hbase:meta row=533fb79ba23b27e9e0715b51daeb30c1, 
> regionState=CLOSED
> Going back in history, we do the above each time the Master gets restarted so 
> the region is offlined and never brought back online.
> It is failing here:
> {code}
>   private void execProcedure(final RootProcedureState procStack,
>       final Procedure<TEnvironment> procedure) {
>     Preconditions.checkArgument(procedure.getState() == 
> ProcedureState.RUNNABLE,
>         procedure.toString());
> {code}
> Its the parent move region that is trying to run and failing. It is not 
> RUNNABLE? Because the subprocedure was 'done' but not fully?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20978) [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure

Reply via email to