[
https://issues.apache.org/jira/browse/HBASE-24090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071913#comment-17071913
]
Duo Zhang commented on HBASE-24090:
-----------------------------------
Can you confirm whether the RS 'RS-IP,RS-Port,1585060304365' is still alive?
We will remove the region from the RIT map when finishing the TRSP, where we
will call RegionStateNode.unsetProcedure.
And IIRC, we will not use the OFFLINE state any more, unless you call
offlineRegion explicitly, so why the region is in OFFLINE state when restarting?
> Regions Stuck in RIT in OPEN state
> ----------------------------------
>
> Key: HBASE-24090
> URL: https://issues.apache.org/jira/browse/HBASE-24090
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Reporter: Pankaj Kumar
> Priority: Major
>
> Observed few regions stuck in RIT in OPEN state in a cluster restart scenario.
> Analysis:
> 1. All RS were killed abruptly.
> 2. HMaster start SCP and initiated region assignments
> {noformat}
> 2020-03-24 22:27:08,821 | INFO | PEWorker-20 | Initialized subprocedures=[
> {pid=49703, ppid=46611,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=usertable18,
> region=75a79e978362d6f4ee1a3e27dfc5d4b6, ASSIGN},...] |
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1697)
> {noformat}
> But HMaster failover happens before it complete.
> 4. New active master load the previous procedures and restore to RIT
> {noformat}
> 2020-03-24 22:30:04,815 | INFO | master/HM-IP:HM-PORT:becomeActiveMaster |
> Attach pid=49703, ppid=46611,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=usertable18,
> region=75a79e978362d6f4ee1a3e27dfc5d4b6, ASSIGN to rit=OFFLINE,
> location=null, table=usertable18, region=75a79e978362d6f4ee1a3e27dfc5d4b6 to
> restore RIT |
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.lambda$setupRIT$0(AssignmentManager.java:280)
> ---
> 2020-03-24 22:32:52,153 | WARN | ProcExecTimeout | STUCK
> Region-In-Transition rit=OPEN, location=RS-IP,RS-Port,1585057875346,
> table=usertable18, region=75a79e978362d6f4ee1a3e27dfc5d4b6 |
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.handleRegionOverStuckWarningThreshold(AssignmentManager.java:1340)
> ---
> 2020-03-24 22:41:51,837 | WARN | master/HM-IP:HM-PORT.Chore.1 |
> unknown_server=RS-IP,RS-Port,1585057875346/usertable01,user10268,1585053943990.871858cf2ef25a9e0e6b4f022a16ebc9.,....
> | org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:181)
> {noformat}
> Region assignment was slow as we are testing with huge number of regions per
> RS, so RIT WARN message logged.
> 5. Finally region was assigned
> HM log:
> {noformat}
> 2020-03-24 22:42:26,386 | INFO | PEWorker-11 | Took xlock for pid=49703,
> ppid=46611, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=usertable18,
> region=75a79e978362d6f4ee1a3e27dfc5d4b6, ASSIGN |
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.waitRegions(MasterProcedureScheduler.java:737)
> 2020-03-24 22:42:26,446 | INFO | PEWorker-11 | Starting pid=49703,
> ppid=46611, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE,
> locked=true; TransitRegionStateProcedure table=usertable18,
> region=75a79e978362d6f4ee1a3e27dfc5d4b6, ASSIGN; rit=OPEN, location=null;
> forceNewPlan=true, retain=false |
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.queueAssign(TransitRegionStateProcedure.java:189)
> 2020-03-24 22:42:26,717 | INFO | PEWorker-17 | pid=49703 updating hbase:meta
> row=75a79e978362d6f4ee1a3e27dfc5d4b6, regionState=OPENING,
> regionLocation=RS-IP,RS-Port,1585060304365 |
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:201)
> 2020-03-24 22:42:27,439 | INFO | PEWorker-19 | pid=49703 updating hbase:meta
> row=75a79e978362d6f4ee1a3e27dfc5d4b6, regionState=OPEN, openSeqNum=5,
> regionLocation=RS-IP,RS-Port,1585060304365 |
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:201)
> 2020-03-24 22:42:27,701 | INFO | PEWorker-19 | Finished subprocedure
> pid=73705, resume processing parent pid=49703, ppid=46611,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=usertable18,
> region=75a79e978362d6f4ee1a3e27dfc5d4b6, ASSIGN |
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.countDownChildren(ProcedureExecutor.java:1837)
> 2020-03-24 22:42:27,821 | INFO | PEWorker-15 | Finished pid=49703,
> ppid=46611, state=SUCCESS; TransitRegionStateProcedure table=usertable18,
> region=75a79e978362d6f4ee1a3e27dfc5d4b6, ASSIGN in 15mins, 18.888sec |
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1427)
> {noformat}
> RS Log:
> {noformat}
> 2020-03-24 22:42:27,230 | INFO |
> RS_OPEN_REGION-regionserver/RS-IP:RS-Port-34 | Open
> usertable18,user29616,1585055007688.75a79e978362d6f4ee1a3e27dfc5d4b6. |
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:123)
> 2020-03-24 22:42:27,241 | INFO |
> StoreOpener-75a79e978362d6f4ee1a3e27dfc5d4b6-1 | Created cacheConfig:
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false,
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false,
> prefetchOnOpen=false for family {NAME => 'family', VERSIONS => '1',
> EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false',
> KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false',
> DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0',
> REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE =>
> 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false',
> PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE =>
> 'true', BLOCKSIZE => '65536'} with blockCache=LruBlockCache{blockCount=0,
> currentSize=5.74 MB, freeSize=7.64 GB, maxSize=7.65 GB, heapSize=5.74 MB,
> minSize=7.27 GB, minFactor=0.95, multiSize=3.63 GB, multiFactor=0.5,
> singleSize=1.82 GB, singleFactor=0.25} |
> org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:174)
> 2020-03-24 22:42:27,242 | INFO |
> StoreOpener-75a79e978362d6f4ee1a3e27dfc5d4b6-1 | size [128 MB, 8.00 EB, 8.00
> EB); files [6, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point
> 1610612736; major period 604800000, major jitter 0.500000, min locality to
> compact 0.000000; tiered compaction: max_age 9223372036854775807, incoming
> window min 6, compaction policy for tiered window
> org.apache.hadoop.hbase.regionserver.compactions.ExploringCompactionPolicy,
> single output for minor true, compaction window factory
> org.apache.hadoop.hbase.regionserver.compactions.ExponentialCompactionWindowFactory
> |
> org.apache.hadoop.hbase.regionserver.compactions.CompactionConfiguration.<init>(CompactionConfiguration.java:147)
> 2020-03-24 22:42:27,243 | INFO |
> StoreOpener-75a79e978362d6f4ee1a3e27dfc5d4b6-1 | Store=family, memstore
> type=DefaultMemStore, storagePolicy=HOT, verifyBulkLoads=false,
> parallelPutCountPrintThreshold=50, encoding=NONE, compression=NONE |
> org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:335)
> 2020-03-24 22:42:27,252 | INFO |
> RS_OPEN_REGION-regionserver/RS-IP:RS-Port-34 | Opened
> 75a79e978362d6f4ee1a3e27dfc5d4b6; next sequenceid=5 |
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:1067)
> 2020-03-24 22:42:27,254 | INFO |
> RS_OPEN_REGION-regionserver/RS-IP:RS-Port-34 | Post open deploy tasks for
> usertable18,user29616,1585055007688.75a79e978362d6f4ee1a3e27dfc5d4b6.,
> openProcId=73705, masterSystemTime=1585060947225 |
> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:2379)
> 2020-03-24 22:42:27,320 | INFO |
> RS_OPEN_REGION-regionserver/RS-IP:RS-Port-34 | Opened
> usertable18,user29616,1585055007688.75a79e978362d6f4ee1a3e27dfc5d4b6. |
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:141)
> {noformat}
> 6. Evn though region was opened successfully but still the region in RIT, in
> OPEN state
> {noformat}
> 2020-03-24 22:49:05,432 | WARN | ProcExecTimeout | STUCK
> Region-In-Transition rit=OPEN, location=RS-IP,RS-Port,1585060304365,
> table=usertable18, region=75a79e978362d6f4ee1a3e27dfc5d4b6 |
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.handleRegionOverStuckWarningThreshold(AssignmentManager.java:1340)
> {noformat}
> This WARN message keep occuring in HM log.
> HBase version: 2.2.3
--
This message was sent by Atlassian Jira
(v8.3.4#803005)