Lijin Bin created HBASE-23594:
---------------------------------
Summary: Procedure stuck due to region happen to recorded on two
servers.
Key: HBASE-23594
URL: https://issues.apache.org/jira/browse/HBASE-23594
Project: HBase
Issue Type: Bug
Reporter: Lijin Bin
Master log:
{code}
$ grep "cf9a4ec6cd890aa6806fb313d71e2ebd"
hbase-hbaseadmin-master-100.107.176.225.log.1
2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20]
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
sharedLock=34 size=1662) to run queue because: the exclusive lock is not held
by anyone when adding pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,851 INFO [KeepAlivePEWorker-17]
procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,852 INFO [KeepAlivePEWorker-17]
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null;
forceNewPlan=true, retain=false
2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17]
procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:60000]
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,912 INFO [PEWorker-9]
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN,
location=100.107.176.215,60020,1576552834619;
loc=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 INFO [PEWorker-9] assignment.RegionStateStore:
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
procedure pid=193706, ppid=187614,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
2019-12-17 11:24:22,918 WARN [PEWorker-8]
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482,
ppid=193706, state=RUNNABLE, locked=true;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region
{ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME =>
'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', STARTKEY
=> 'user68694', ENDKEY => 'user68703'} to server
100.107.176.215,60020,1576552834619, this usually because the server is alread
dead, give up and mark the procedure as complete, the parent procedure will
take care of this.
2019-12-17 11:24:22,921 DEBUG [PEWorker-8] procedure.MasterProcedureScheduler:
Add TableQueue(table1w_7, xlock=false sharedLock=331 size=1664) to run queue
because: pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,921 INFO [PEWorker-8] procedure2.ProcedureExecutor:
Finished subprocedure pid=243482, resume processing parent pid=193706,
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
locked=true; TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,921 INFO [PEWorker-9]
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=193706,
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
locked=true; TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING,
location=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,921 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
procedure pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52157th rollback step
2019-12-17 11:24:22,922 INFO [PEWorker-9]
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=null;
forceNewPlan=true, retain=false
2019-12-17 11:24:22,922 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add
procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN,
locked=true; TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52164th rollback step
2019-12-17 11:24:22,930 DEBUG [master/100.107.176.225:60000]
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
sharedLock=331 size=1661) to run queue because: pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,943 INFO [PEWorker-14]
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPENING,
location=100.107.164.90,60020,1576553001648;
loc=100.107.164.90,60020,1576553001648
2019-12-17 11:24:22,943 INFO [PEWorker-14] assignment.RegionStateStore:
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
2019-12-17 11:24:22,944 DEBUG [PEWorker-14] procedure2.RootProcedureState: Add
procedure pid=193706, ppid=187614,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52284th rollback step
2019-12-17 11:24:23,000 INFO [PEWorker-11] assignment.RegionStateStore:
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
regionState=OPEN, openSeqNum=17,
regionLocation=100.107.164.90,60020,1576553001648
2019-12-17 11:24:23,003 DEBUG [PEWorker-11] procedure.MasterProcedureScheduler:
Add TableQueue(table1w_7, xlock=false sharedLock=336 size=1635) to run queue
because: pid=193706, ppid=187614,
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:23,003 INFO [PEWorker-11] procedure2.ProcedureExecutor:
Finished subprocedure pid=243523, resume processing parent pid=193706,
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
locked=true; TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:23,003 DEBUG [KeepAlivePEWorker-17]
procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614,
state=SUCCESS, locked=true; TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52624th rollback step
2019-12-17 11:24:23,005 INFO [KeepAlivePEWorker-17]
procedure2.ProcedureExecutor: Finished pid=193706, ppid=187614, state=SUCCESS;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN in 19.7230sec
2019-12-17 11:24:28,902 DEBUG [PEWorker-7] procedure.MasterProcedureScheduler:
Add TableQueue(table1w_7, xlock=false sharedLock=27 size=1856) to run queue
because: the exclusive lock is not held by anyone when adding pid=258071,
ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GE
2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0]
procedure2.ProcedureExecutor: Stored pid=243483,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
2019-12-17 11:24:39,485 INFO [PEWorker-10] procedure.MasterProcedureScheduler:
Took xlock for pid=258071, ppid=243483,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:39,486 INFO [PEWorker-10]
assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null;
forceNewPlan=true, retain=false
2019-12-17 11:24:39,486 DEBUG [PEWorker-10] procedure2.RootProcedureState: Add
procedure pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN,
locked=true; TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68029th rollback step
2019-12-17 11:24:39,571 DEBUG [master/100.107.176.225:60000]
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false
sharedLock=47 size=1847) to run queue because: pid=258071, ppid=243483,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:39,625 INFO [PEWorker-16]
assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN,
location=100.107.164.90,60020,1576553001648;
loc=100.107.164.90,60020,1576553001648
2019-12-17 11:24:39,625 INFO [PEWorker-16] assignment.RegionStateStore:
pid=258071 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd,
regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
2019-12-17 11:24:39,626 DEBUG [PEWorker-16] procedure2.RootProcedureState: Add
procedure pid=258071, ppid=243483,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
TransitRegionStateProcedure table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68773th rollback step
2019-12-17 11:26:23,217 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition rit=OPENING,
location=100.107.164.90,60020,1576553001648, table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:27:23,217 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition rit=OPENING,
location=100.107.164.90,60020,1576553001648, table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:28:06,532 INFO [master/100.107.176.225:60000.Chore.1]
master.HMaster: Not running balancer because 1 region(s) in transition:
[rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd]
2019-12-17 11:28:23,218 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition rit=OPENING,
location=100.107.164.90,60020,1576553001648, table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:29:23,218 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition rit=OPENING,
location=100.107.164.90,60020,1576553001648, table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:30:23,219 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition rit=OPENING,
location=100.107.164.90,60020,1576553001648, table=table1w_7,
region=cf9a4ec6cd890aa6806fb313d71e2ebd
{code}
RegionServer log:
{code}
$ grep "cf9a4ec6cd890aa6806fb313d71e2ebd"
hbase-hbaseadmin-regionserver-100.107.164.90.log
2019-12-17 11:24:22,972 INFO
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
handler.AssignRegionHandler: Open
table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
2019-12-17 11:24:22,972 INFO
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion:
filter rows disabled with empty filtering pattern for
region=table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd..
2019-12-17 11:24:22,983 INFO [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1]
hfile.CacheConfig: Created cacheConfig: cacheDataOnRead=true,
cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false,
cacheEvictOnClose=true, cacheDataCompressed=false, prefetchOnOpen=false for
family {NAME => 'cf', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false',
NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE',
CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'DIFF', TTL =>
'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW',
CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE =>
'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY',
BLOCKCACHE => 'true', BLOCKSIZE => '65536'} with
blockCache=org.apache.hadoop.hbase.io.hfile.CombinedBlockCache@1376a714
2019-12-17 11:24:22,983 INFO [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1]
compactions.CompactionConfiguration: size [128 MB, 2 GB, 2 GB); files [7, 25);
ratio 1.200000; off-peak ratio 5.000000; throttle point 4294967296; major
period 0, major jitter 0.500000, min locality to compact 0.000000; tiered
compaction: max_age 9223372036854775807, incoming window min 6, compaction
policy for tiered window
org.apache.hadoop.hbase.regionserver.compactions.ExploringCompactionPolicy,
single output for minor true, compaction window factory
org.apache.hadoop.hbase.regionserver.compactions.ExponentialCompactionWindowFactory
2019-12-17 11:24:22,984 INFO [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1]
regionserver.HStore: Store=cf, memstore type=DefaultMemStore,
storagePolicy=HOT, verifyBulkLoads=false, parallelPutCountPrintThreshold=50,
encoding=DIFF, compression=SNAPPY
2019-12-17 11:24:22,987 INFO
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion:
Opened cf9a4ec6cd890aa6806fb313d71e2ebd; next sequenceid=17
2019-12-17 11:24:22,987 INFO
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
regionserver.HRegionServer: Post open deploy tasks for
table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.,
openProcId=243523, masterSystemTime=1576553062965
2019-12-17 11:24:22,990 INFO
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
handler.AssignRegionHandler: Opened
table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
2019-12-17 11:24:39,657 WARN
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1]
handler.AssignRegionHandler: Received OPEN for the
region:table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.,
which is already online
{code}
2019-12-17 11:23:49,598 DEBUG [RegionServerTracker-0]
procedure2.ProcedureExecutor: Stored pid=187614,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=100.107.165.41,60020,1576552792328, splitWal=true, meta=true
Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign region
cf9a4ec6cd890aa6806fb313d71e2ebd,
region assign to 100.107.176.215,60020,1576552834619, but failed, so retry and
assign to 100.107.164.90,60020,1576553001648 and open on
100.107.164.90,60020,1576553001648 success.
Later process ServerCrashProcedure(100.107.176.215,60020,1576552834619) and
find region cf9a4ec6cd890aa6806fb313d71e2ebd on this server also, than try
assign this region, and assign the region to 100.107.164.90,60020,1576553001648
also, but 100.107.164.90,60020,1576553001648 find it has open this region
already, than the assign stuck.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)