[
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HBASE-23594 started by Duo Zhang.
-----------------------------------------
> Procedure stuck due to region happen to recorded on two servers.
> ----------------------------------------------------------------
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.2.2
> Reporter: Lijin Bin
> Assignee: Duo Zhang
> Priority: Major
> Attachments: HBASE-23594-UT.patch
>
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd"
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held
> by anyone when adding pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO [KeepAlivePEWorker-17]
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO [KeepAlivePEWorker-17]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null;
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17]
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:60000]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO [PEWorker-9]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN,
> location=100.107.176.215,60020,1576552834619;
> loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO [PEWorker-9] assignment.RegionStateStore:
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
> procedure pid=193706, ppid=187614,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN [PEWorker-8]
> assignment.RegionRemoteProcedureBase: Can not add remote operation
> pid=243482, ppid=193706, state=RUNNABLE, locked=true;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region
> {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME =>
> 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.',
> STARTKEY => 'user68694', ENDKEY => 'user68703'} to server
> 100.107.176.215,60020,1576552834619, this usually because the server is
> alread dead, give up and mark the procedure as complete, the parent procedure
> will take care of this.
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-8]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,921 INFO [PEWorker-8] procedure2.ProcedureExecutor:
> Finished subprocedure pid=243482, resume processing parent pid=193706,
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
> locked=true; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,921 INFO [PEWorker-9]
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647;
> pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING,
> location=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
> procedure pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52157th rollback step
> 2019-12-17 11:24:22,922 INFO [PEWorker-9]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=null;
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,922 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
> procedure pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52164th rollback step
> 2019-12-17 11:24:22,930 DEBUG [master/100.107.176.225:60000]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=331 size=1661) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,943 INFO [PEWorker-14]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPENING,
> location=100.107.164.90,60020,1576553001648;
> loc=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:22,943 INFO [PEWorker-14] assignment.RegionStateStore:
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:22,944 DEBUG [PEWorker-14] procedure2.RootProcedureState:
> Add procedure pid=193706, ppid=187614,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52284th rollback step
> 2019-12-17 11:24:23,000 INFO [PEWorker-11] assignment.RegionStateStore:
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPEN, openSeqNum=17,
> regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:23,003 DEBUG [PEWorker-11]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=336 size=1635) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:23,003 INFO [PEWorker-11] procedure2.ProcedureExecutor:
> Finished subprocedure pid=243523, resume processing parent pid=193706,
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
> locked=true; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:23,003 DEBUG [KeepAlivePEWorker-17]
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614,
> state=SUCCESS, locked=true; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52624th rollback step
> 2019-12-17 11:24:23,005 INFO [KeepAlivePEWorker-17]
> procedure2.ProcedureExecutor: Finished pid=193706, ppid=187614,
> state=SUCCESS; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN in 19.7230sec
> 2019-12-17 11:24:28,902 DEBUG [PEWorker-7]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=27 size=1856) to run queue because: the exclusive lock is not held
> by anyone when adding pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_GE
> 2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0]
> procedure2.ProcedureExecutor: Stored pid=243483,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
> 2019-12-17 11:24:39,485 INFO [PEWorker-10]
> procedure.MasterProcedureScheduler: Took xlock for pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:39,486 INFO [PEWorker-10]
> assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null;
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:39,486 DEBUG [PEWorker-10] procedure2.RootProcedureState:
> Add procedure pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68029th rollback step
> 2019-12-17 11:24:39,571 DEBUG [master/100.107.176.225:60000]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=47 size=1847) to run queue because: pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:39,625 INFO [PEWorker-16]
> assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN,
> location=100.107.164.90,60020,1576553001648;
> loc=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:39,625 INFO [PEWorker-16] assignment.RegionStateStore:
> pid=258071 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:39,626 DEBUG [PEWorker-16] procedure2.RootProcedureState:
> Add procedure pid=258071, ppid=243483,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68773th rollback step
> 2019-12-17 11:26:23,217 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:27:23,217 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:28:06,532 INFO [master/100.107.176.225:60000.Chore.1]
> master.HMaster: Not running balancer because 1 region(s) in transition:
> [rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd]
> 2019-12-17 11:28:23,218 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:29:23,218 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:30:23,219 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> {code}
> RegionServer log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd"
> hbase-hbaseadmin-regionserver-100.107.164.90.log
> 2019-12-17 11:24:22,972 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> handler.AssignRegionHandler: Open
> table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
> 2019-12-17 11:24:22,972 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion:
> filter rows disabled with empty filtering pattern for
> region=table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd..
> 2019-12-17 11:24:22,983 INFO
> [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] hfile.CacheConfig: Created
> cacheConfig: cacheDataOnRead=true, cacheDataOnWrite=false,
> cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=true,
> cacheDataCompressed=false, prefetchOnOpen=false for family {NAME => 'cf',
> VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR =>
> 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false',
> DATA_BLOCK_ENCODING => 'DIFF', TTL => 'FOREVER', MIN_VERSIONS => '0',
> REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE =>
> 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false',
> PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY', BLOCKCACHE =>
> 'true', BLOCKSIZE => '65536'} with
> blockCache=org.apache.hadoop.hbase.io.hfile.CombinedBlockCache@1376a714
> 2019-12-17 11:24:22,983 INFO
> [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1]
> compactions.CompactionConfiguration: size [128 MB, 2 GB, 2 GB); files [7,
> 25); ratio 1.200000; off-peak ratio 5.000000; throttle point 4294967296;
> major period 0, major jitter 0.500000, min locality to compact 0.000000;
> tiered compaction: max_age 9223372036854775807, incoming window min 6,
> compaction policy for tiered window
> org.apache.hadoop.hbase.regionserver.compactions.ExploringCompactionPolicy,
> single output for minor true, compaction window factory
> org.apache.hadoop.hbase.regionserver.compactions.ExponentialCompactionWindowFactory
> 2019-12-17 11:24:22,984 INFO
> [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] regionserver.HStore:
> Store=cf, memstore type=DefaultMemStore, storagePolicy=HOT,
> verifyBulkLoads=false, parallelPutCountPrintThreshold=50, encoding=DIFF,
> compression=SNAPPY
> 2019-12-17 11:24:22,987 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion:
> Opened cf9a4ec6cd890aa6806fb313d71e2ebd; next sequenceid=17
> 2019-12-17 11:24:22,987 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> regionserver.HRegionServer: Post open deploy tasks for
> table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.,
> openProcId=243523, masterSystemTime=1576553062965
> 2019-12-17 11:24:22,990 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> handler.AssignRegionHandler: Opened
> table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
> 2019-12-17 11:24:39,657 WARN
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> handler.AssignRegionHandler: Received OPEN for the
> region:table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.,
> which is already online
> {code}
> 2019-12-17 11:23:49,598 DEBUG [RegionServerTracker-0]
> procedure2.ProcedureExecutor: Stored pid=187614,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=100.107.165.41,60020,1576552792328, splitWal=true, meta=true
> Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign
> region cf9a4ec6cd890aa6806fb313d71e2ebd,
> region assign to 100.107.176.215,60020,1576552834619, but failed, so retry
> and assign to 100.107.164.90,60020,1576553001648 and open on
> 100.107.164.90,60020,1576553001648 success.
> Later process ServerCrashProcedure(100.107.176.215,60020,1576552834619) and
> find region cf9a4ec6cd890aa6806fb313d71e2ebd on this server also, than try
> assign this region, and assign the region to
> 100.107.164.90,60020,1576553001648 also, but
> 100.107.164.90,60020,1576553001648 find it has open this region already, than
> the assign stuck.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)