[
https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118186#comment-17118186
]
Michael Stack commented on HBASE-24438:
---------------------------------------
So, what is the issue? The staleness? Or difference between 'Procedures and
Locks' and Tasks list? Thanks [~timoha]
> Stale ServerCrashProcedure task in HBase Master UI
> --------------------------------------------------
>
> Key: HBASE-24438
> URL: https://issues.apache.org/jira/browse/HBASE-24438
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 2.2.4
> Environment: HBase 2.2.4
> HDFS 3.1.3 with erasure coding enabled
> Kubernetes
> Reporter: Andrey Elenskiy
> Priority: Major
>
> Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with
> ServerCrashProcedure after master failover. The procedures have finished with
> SUCCESS on a previously active HBase master and aren't showing in "Procedures
> & Locks".
> Based on the logs, both of those regionserver were carrying hbase:meta (logs
> are sorted newest first grepped for those specific servers that have stale
> ServerCrashProcedures):
> {noformat}
> 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28]
> procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS;
> ServerCrashProcedure
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
> splitWal=true, meta=true in 2.5290sec
> 2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28]
> procedure.ServerCrashProcedure: removed crashed server
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after
> splitting done
> 2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
> dead splitlog workers
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
> 2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager:
> Log dir for server
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not
> exist
> 2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28]
> procedure.ServerCrashProcedure:
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
> 2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28]
> procedure.ServerCrashProcedure: pid=38,
> state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
> splitWal=true, meta=true found RIT pid=20, ppid=18,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
> rit=ABNORMALLY_CLOSED,
> location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
> table=hbase:meta, region=1588230740
> 2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
> Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in
> [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
> in 0ms
> 2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
> hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting
> dir is empty, no logs to split.
> 2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
> dead splitlog workers
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
> 2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28]
> procedure.ServerCrashProcedure: Start pid=38,
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
> splitWal=true, meta=true
> 2020-05-21 19:04:06,544 INFO [RegionServerTracker-0]
> assignment.AssignmentManager: Scheduled SCP pid=38 for
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347
> (carryingMeta=true)
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write
> locks = 1, Read locks = 0], oldState=ONLINE.
> 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager:
> Processing expiration of
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on
> hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
> 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0]
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing
> expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
> ...
> 2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27]
> procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS;
> ServerCrashProcedure
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
> splitWal=true, meta=true in 997msec
> 2020-05-21 19:04:04,497 INFO [KeepAlivePEWorker-27]
> procedure.ServerCrashProcedure: removed crashed server
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 after
> splitting done
> 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
> dead splitlog workers
> [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
> 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.MasterWalManager:
> Log dir for server
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 does not
> exist
> 2020-05-21 19:04:04,175 INFO [KeepAlivePEWorker-27]
> procedure.ServerCrashProcedure:
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 had 1 regions
> 2020-05-21 19:04:03,958 INFO [KeepAlivePEWorker-27]
> procedure.ServerCrashProcedure: pid=37,
> state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
> splitWal=true, meta=true found RIT pid=20, ppid=18,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
> rit=OPENING,
> location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
> table=hbase:meta, region=1588230740
> 2020-05-21 19:04:03,852 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
> Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in
> [hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting]
> in 0ms
> 2020-05-21 19:04:03,845 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
> hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting
> dir is empty, no logs to split.
> 2020-05-21 19:04:03,843 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
> dead splitlog workers
> [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
> 2020-05-21 19:04:03,725 INFO [KeepAlivePEWorker-27]
> procedure.ServerCrashProcedure: Start pid=37,
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
> splitWal=true, meta=true
> 2020-05-21 19:04:03,512 INFO [RegionServerTracker-0]
> assignment.AssignmentManager: Scheduled SCP pid=37 for
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
> (carryingMeta=true)
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5bcc7453[Write
> locks = 1, Read locks = 0], oldState=ONLINE.
> 2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] master.ServerManager:
> Processing expiration of
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 on
> hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
> 2020-05-21 19:04:03,500 INFO [RegionServerTracker-0]
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing
> expiration [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
> {noformat}
>
> The currently active HBase Master (that has the stale tasks) logged only
> about one regionserver that is shown in stale tasks (logs sorted oldest
> first):
>
> {noformat}
> 2020-05-21 19:04:26,467 INFO [master/hbasemaster-1:16000:becomeActiveMaster]
> assignment.AssignmentManager: Attach pid=20, ppid=18,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN; TransitRegionStateProcedure
> table=hbase:meta, region=1588230740, ASSIGN to rit=OPENING,
> location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
> table=hbase:meta, region=1588230740 to restore RIT
> 2020-05-21 19:04:26,758 INFO [master/hbasemaster-1:16000:becomeActiveMaster]
> master.HMaster: hbase:meta {1588230740 state=OPENING, ts=1590087849283,
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010}
> 2020-05-21 19:04:26,784 INFO [PEWorker-1] zookeeper.MetaTableLocator: Setting
> hbase:meta (replicaId=0) location in ZooKeeper as
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
> 2020-05-21 19:04:27,022 WARN [PEWorker-1]
> assignment.RegionRemoteProcedureBase: Can not add remote operation pid=39,
> ppid=20, state=RUNNABLE, locked=true;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region
> {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY =>
> ''} to server
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, this
> usually because the server is alread dead, give up and mark the procedure as
> complete, the parent procedure will take care of this.
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException:
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010; pid=39,
> ppid=20, state=RUNNABLE, locked=true;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2020-05-21 19:04:27,048 INFO [PEWorker-2]
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=20,
> ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
> rit=OPENING,
> location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
> {noformat}
>
> It hasn't logged anything about procedures 37 and 38 (the successfully
> completed ServerCrashProcedures).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)