[
https://issues.apache.org/jira/browse/HBASE-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562183#comment-16562183
]
stack commented on HBASE-20978:
-------------------------------
Parent MRP pid=1820 is WAITING. The pid=1823 unassign sub-procedure doesn't
run to finish up. Looking in pv2 log, I see this recorded for pid=1823:
{code}
pid=1823, ppid=1820, state=SUCCESS; UnassignProcedure
table=IntegrationTestBigLinkedList, region=533fb79ba23b27e9e0715b51daeb30c1,
server=ve0538.halxg.cloudera.com,16020,1532847421672
submittedTime=1532853218235, lastUpdate=1532853219788
stackIndexes=[1, 2]
{code}
This is the last entry for pid=1820
{code}
pid=1820, state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure
hri=533fb79ba23b27e9e0715b51daeb30c1,
source=ve0538.halxg.cloudera.com,16020,1532847421672,
destination=ve0540.halxg.cloudera.com,16020,1532853151031
submittedTime=1532853217647, lastUpdate=1532853218235
stackIndexes=[0]
{code}
... which matches the log entry:
2018-07-29 01:33:38,235 INFO [PEWorker-11] procedure.MasterProcedureScheduler:
pid=1820, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
hri=533fb79ba23b27e9e0715b51daeb30c1,
source=ve0538.halxg.cloudera.com,16020,1532847421672,
destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on
533fb79ba23b27e9e0715b51daeb30c1
This log was cut off on crash.
2018-07-29 01:33:57,355 WARN [master/ve0524:16000] wal.WALProcedureStore:
Unable to read tracker for
hdfs://ve0524.halxg.cloudera.com:8020/hbase/MasterProcWALs/pv2-00000000000000000114.log
- Invalid Trailer version. got 116 expected 1
It does seem like we failed to run the change-over of control from pid=1823
subprocedure back to parent pid=1820.
Rerun this step as part of loading procedure wal on startup? Dig.
> [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure
> ----------------------------------------------------------------
>
> Key: HBASE-20978
> URL: https://issues.apache.org/jira/browse/HBASE-20978
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 2.0.1
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 2.0.2
>
>
> Testing tip of branch-2.0, ran into this:
> {code}
> 2018-07-29 01:45:33,002 INFO [master/ve0524:16000] master.HMaster: Master
> has completed initialization 13.854sec
> 2018-07-29
> 01:45:33,003 INFO [PEWorker-4] procedure.MasterProcedureScheduler: pid=1820,
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure
> hri=533fb79ba23b27e9e0715b51daeb30c1,
> source=ve0538.halxg.cloudera.com,16020,1532847421672,
> destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on
> 533fb79ba23b27e9e0715b51daeb30c1
> 2018-07-29 01:45:33,003
> WARN [PEWorker-4] procedure2.ProcedureExecutor: Worker terminating
> UNNATURALLY null
> java.lang.IllegalArgumentException: pid=1820,
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure
> hri=533fb79ba23b27e9e0715b51daeb30c1,
> source=ve0538.halxg.cloudera.com,16020,1532847421672,
> destination=ve0540.halxg.cloudera.com,16020,1532853151031
> at
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
>
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)
>
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
> {code}
> It then shows as the below in the UI:
> {code}
> Id Parent State Owner Type Start Time Last Update Errors
> Parameters
> 1820 WAITING stack MoveRegionProcedure
> hri=533fb79ba23b27e9e0715b51daeb30c1,
> source=ve0538.halxg.cloudera.com,16020,1532847421672,
> destination=ve0540.halxg.cloudera.com,16020,1532853151031 Sun Jul 29
> 01:33:37 PDT 2018 Sun Jul 29 01:33:38 PDT 2018 [ { state => [
> '1', '2' ] }, { regionId => '1532851768240', tableName => { namespace =>
> 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' },
> startKey => 'VttDLvXHdcmzwqNdrNoUFg==', endKey => 'WGFV8k+hFqhcIJGiKZ8L4Q==',
> offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => {
> hostName => 've0538.halxg.cloudera.com', port => '16020', startCode =>
> '1532847421672' }, destinationServer => { hostName =>
> 've0540.halxg.cloudera.com', port => '16020', startCode => '1532853151031' }
> } ]
> {code}
> This is what we'd just read from hbase:meta:
> {code}
> 2018-07-29 01:45:32,802 INFO [master/ve0524:16000]
> assignment.RegionStateStore: Load hbase:meta entry
> region=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED,
> lastHost=ve0538.halxg.cloudera.com,16020,1532847421672,
> regionLocation=ve0538.halxg.cloudera.com,16020,1532847421672,
> openSeqNum=1544600
> {code}
> Before this, we'd just logged this:
> 2018-07-29 01:33:39,786 INFO [PEWorker-14] assignment.RegionStateStore:
> pid=1823 updating hbase:meta row=533fb79ba23b27e9e0715b51daeb30c1,
> regionState=CLOSED
> Going back in history, we do the above each time the Master gets restarted so
> the region is offlined and never brought back online.
> It is failing here:
> {code}
> private void execProcedure(final RootProcedureState procStack,
> final Procedure<TEnvironment> procedure) {
> Preconditions.checkArgument(procedure.getState() ==
> ProcedureState.RUNNABLE,
> procedure.toString());
> {code}
> Its the parent move region that is trying to run and failing. It is not
> RUNNABLE? Because the subprocedure was 'done' but not fully?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)