[
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999765#comment-16999765
]
Duo Zhang commented on HBASE-23594:
-----------------------------------
{quote}
Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign region
cf9a4ec6cd890aa6806fb313d71e2ebd,
region assign to 100.107.176.215,60020,1576552834619, but failed, so retry and
assign to 100.107.164.90,60020,1576553001648 and open on
100.107.164.90,60020,1576553001648 success.
{quote}
This is a bit strange, we will only try to reassign when the regionserver is
dead, and if the target server is dead, we will skip reassign and let the SCP
interrupt us.
So the problem here is that, we quit due to other conditions, and then in ORP
we find out the RS is not dead, so we finish the ORP and let the parent TRSP to
reassign, and at the same time, the RS is dead and SCP is coming, and finds out
the region is in opening state so includes it in the region list for
reassigning, and then the existing TRSP finished, so when assigning regions in
SCP, we will assign it again. This could cause a double assign or a dead lock
if it schedules the region to the same server again.
Seems possible, [~binlijin] could you please what is the reason we give up
assign the region to 100.107.176.215,60020,1576552834619? You can see the code
in RSProcedureDispatcher.ExecuteProceduresRemoteCall.scheduleForRetry, I think
we have a warn log for every quit reason.
Thanks.
> Procedure stuck due to region happen to recorded on two servers.
> ----------------------------------------------------------------
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.2.2
> Reporter: Lijin Bin
> Priority: Major
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd"
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held
> by anyone when adding pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO [KeepAlivePEWorker-17]
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO [KeepAlivePEWorker-17]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null;
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17]
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:60000]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO [PEWorker-9]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN,
> location=100.107.176.215,60020,1576552834619;
> loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO [PEWorker-9] assignment.RegionStateStore:
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
> procedure pid=193706, ppid=187614,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN [PEWorker-8]
> assignment.RegionRemoteProcedureBase: Can not add remote operation
> pid=243482, ppid=193706, state=RUNNABLE, locked=true;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region
> {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME =>
> 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.',
> STARTKEY => 'user68694', ENDKEY => 'user68703'} to server
> 100.107.176.215,60020,1576552834619, this usually because the server is
> alread dead, give up and mark the procedure as complete, the parent procedure
> will take care of this.
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-8]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,921 INFO [PEWorker-8] procedure2.ProcedureExecutor:
> Finished subprocedure pid=243482, resume processing parent pid=193706,
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
> locked=true; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,921 INFO [PEWorker-9]
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647;
> pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING,
> location=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
> procedure pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52157th rollback step
> 2019-12-17 11:24:22,922 INFO [PEWorker-9]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=null;
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,922 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
> procedure pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52164th rollback step
> 2019-12-17 11:24:22,930 DEBUG [master/100.107.176.225:60000]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=331 size=1661) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,943 INFO [PEWorker-14]
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPENING,
> location=100.107.164.90,60020,1576553001648;
> loc=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:22,943 INFO [PEWorker-14] assignment.RegionStateStore:
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:22,944 DEBUG [PEWorker-14] procedure2.RootProcedureState:
> Add procedure pid=193706, ppid=187614,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52284th rollback step
> 2019-12-17 11:24:23,000 INFO [PEWorker-11] assignment.RegionStateStore:
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPEN, openSeqNum=17,
> regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:23,003 DEBUG [PEWorker-11]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=336 size=1635) to run queue because: pid=193706, ppid=187614,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:23,003 INFO [PEWorker-11] procedure2.ProcedureExecutor:
> Finished subprocedure pid=243523, resume processing parent pid=193706,
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
> locked=true; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:23,003 DEBUG [KeepAlivePEWorker-17]
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614,
> state=SUCCESS, locked=true; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52624th rollback step
> 2019-12-17 11:24:23,005 INFO [KeepAlivePEWorker-17]
> procedure2.ProcedureExecutor: Finished pid=193706, ppid=187614,
> state=SUCCESS; TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN in 19.7230sec
> 2019-12-17 11:24:28,902 DEBUG [PEWorker-7]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=27 size=1856) to run queue because: the exclusive lock is not held
> by anyone when adding pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_GE
> 2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0]
> procedure2.ProcedureExecutor: Stored pid=243483,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
> 2019-12-17 11:24:39,485 INFO [PEWorker-10]
> procedure.MasterProcedureScheduler: Took xlock for pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:39,486 INFO [PEWorker-10]
> assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null;
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:39,486 DEBUG [PEWorker-10] procedure2.RootProcedureState:
> Add procedure pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68029th rollback step
> 2019-12-17 11:24:39,571 DEBUG [master/100.107.176.225:60000]
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
> sharedLock=47 size=1847) to run queue because: pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:39,625 INFO [PEWorker-16]
> assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483,
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN,
> location=100.107.164.90,60020,1576553001648;
> loc=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:39,625 INFO [PEWorker-16] assignment.RegionStateStore:
> pid=258071 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
> regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:39,626 DEBUG [PEWorker-16] procedure2.RootProcedureState:
> Add procedure pid=258071, ppid=243483,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
> TransitRegionStateProcedure table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68773th rollback step
> 2019-12-17 11:26:23,217 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:27:23,217 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:28:06,532 INFO [master/100.107.176.225:60000.Chore.1]
> master.HMaster: Not running balancer because 1 region(s) in transition:
> [rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd]
> 2019-12-17 11:28:23,218 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:29:23,218 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:30:23,219 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition rit=OPENING,
> location=100.107.164.90,60020,1576553001648, table=table1w_7,
> region=cf9a4ec6cd890aa6806fb313d71e2ebd
> {code}
> RegionServer log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd"
> hbase-hbaseadmin-regionserver-100.107.164.90.log
> 2019-12-17 11:24:22,972 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> handler.AssignRegionHandler: Open
> table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
> 2019-12-17 11:24:22,972 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion:
> filter rows disabled with empty filtering pattern for
> region=table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd..
> 2019-12-17 11:24:22,983 INFO
> [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] hfile.CacheConfig: Created
> cacheConfig: cacheDataOnRead=true, cacheDataOnWrite=false,
> cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=true,
> cacheDataCompressed=false, prefetchOnOpen=false for family {NAME => 'cf',
> VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR =>
> 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false',
> DATA_BLOCK_ENCODING => 'DIFF', TTL => 'FOREVER', MIN_VERSIONS => '0',
> REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE =>
> 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false',
> PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY', BLOCKCACHE =>
> 'true', BLOCKSIZE => '65536'} with
> blockCache=org.apache.hadoop.hbase.io.hfile.CombinedBlockCache@1376a714
> 2019-12-17 11:24:22,983 INFO
> [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1]
> compactions.CompactionConfiguration: size [128 MB, 2 GB, 2 GB); files [7,
> 25); ratio 1.200000; off-peak ratio 5.000000; throttle point 4294967296;
> major period 0, major jitter 0.500000, min locality to compact 0.000000;
> tiered compaction: max_age 9223372036854775807, incoming window min 6,
> compaction policy for tiered window
> org.apache.hadoop.hbase.regionserver.compactions.ExploringCompactionPolicy,
> single output for minor true, compaction window factory
> org.apache.hadoop.hbase.regionserver.compactions.ExponentialCompactionWindowFactory
> 2019-12-17 11:24:22,984 INFO
> [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] regionserver.HStore:
> Store=cf, memstore type=DefaultMemStore, storagePolicy=HOT,
> verifyBulkLoads=false, parallelPutCountPrintThreshold=50, encoding=DIFF,
> compression=SNAPPY
> 2019-12-17 11:24:22,987 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion:
> Opened cf9a4ec6cd890aa6806fb313d71e2ebd; next sequenceid=17
> 2019-12-17 11:24:22,987 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> regionserver.HRegionServer: Post open deploy tasks for
> table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.,
> openProcId=243523, masterSystemTime=1576553062965
> 2019-12-17 11:24:22,990 INFO
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> handler.AssignRegionHandler: Opened
> table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
> 2019-12-17 11:24:39,657 WARN
> [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
> handler.AssignRegionHandler: Received OPEN for the
> region:table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.,
> which is already online
> {code}
> 2019-12-17 11:23:49,598 DEBUG [RegionServerTracker-0]
> procedure2.ProcedureExecutor: Stored pid=187614,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=100.107.165.41,60020,1576552792328, splitWal=true, meta=true
> Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign
> region cf9a4ec6cd890aa6806fb313d71e2ebd,
> region assign to 100.107.176.215,60020,1576552834619, but failed, so retry
> and assign to 100.107.164.90,60020,1576553001648 and open on
> 100.107.164.90,60020,1576553001648 success.
> Later process ServerCrashProcedure(100.107.176.215,60020,1576552834619) and
> find region cf9a4ec6cd890aa6806fb313d71e2ebd on this server also, than try
> assign this region, and assign the region to
> 100.107.164.90,60020,1576553001648 also, but
> 100.107.164.90,60020,1576553001648 find it has open this region already, than
> the assign stuck.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)