[ 
https://issues.apache.org/jira/browse/HBASE-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704339#comment-16704339
 ] 

Duo Zhang commented on HBASE-21534:
-----------------------------------

Look at the log, it seems that we assign meta to the server which is already 
marked as dead. This should be a test issue, let me provide a fix.

{noformat}
2018-11-29 20:47:49,751 INFO  [PEWorker-63] 
procedure.ServerCrashProcedure(131): Start pid=14, 
state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
server=localhost,102,1, splitWal=false, meta=true
2018-11-29 20:47:49,751 DEBUG [PEWorker-63] procedure2.RootProcedureState(153): 
Add procedure pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, 
hasLock=true; ServerCrashProcedure server=localhost,102,1, splitWal=false, 
meta=true as the 0th rollback step
2018-11-29 20:47:49,751 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(351): Add ServerQueue(localhost,102,1, 
xlock=false sharedLock=0 size=0) to run queue because: pid=14, 
state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, hasLock=false; 
ServerCrashProcedure server=localhost,102,1, splitWal=false, meta=true released 
exclusive lock
2018-11-29 20:47:49,751 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(351): Add ServerQueue(localhost,102,1, 
xlock=false sharedLock=0 size=1) to run queue because: the exclusive lock is 
not held by anyone when adding pid=14, 
state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, hasLock=false; 
ServerCrashProcedure server=localhost,102,1, splitWal=false, meta=true
2018-11-29 20:47:49,751 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(361): Remove ServerQueue(localhost,102,1, 
xlock=false sharedLock=0 size=0) from run queue because: queue is empty after 
polling out pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, hasLock=false; 
ServerCrashProcedure server=localhost,102,1, splitWal=false, meta=true
2018-11-29 20:47:49,751 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(361): Remove ServerQueue(localhost,102,1, 
xlock=true (14) sharedLock=0 size=0) from run queue because: pid=14, 
state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, hasLock=false; 
ServerCrashProcedure server=localhost,102,1, splitWal=false, meta=true held 
exclusive lock
2018-11-29 20:47:49,751 DEBUG [PEWorker-63] 
procedure.ServerCrashProcedure(207): Splitting meta WALs pid=14, 
state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, hasLock=true; ServerCrashProcedure 
server=localhost,102,1, splitWal=false, meta=true
2018-11-29 20:47:49,755 INFO  [PEWorker-63] master.MasterWalManager(314): Log 
dir for server localhost,102,1 does not exist
2018-11-29 20:47:49,755 INFO  [PEWorker-63] master.SplitLogManager(465): dead 
splitlog workers [localhost,102,1]
2018-11-29 20:47:49,755 INFO  [PEWorker-63] master.SplitLogManager(297): 
Finished splitting (more than or equal to) 0 bytes in 0 log files in [] in 0ms
2018-11-29 20:47:49,756 DEBUG [PEWorker-63] 
procedure.ServerCrashProcedure(213): Done splitting meta WALs pid=14, 
state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, hasLock=true; ServerCrashProcedure 
server=localhost,102,1, splitWal=false, meta=true
2018-11-29 20:47:49,756 DEBUG [PEWorker-63] procedure2.RootProcedureState(153): 
Add procedure pid=14, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, hasLock=true; 
ServerCrashProcedure server=localhost,102,1, splitWal=false, meta=true as the 
1th rollback step
2018-11-29 20:47:49,756 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(351): Add ServerQueue(localhost,102,1, 
xlock=false sharedLock=0 size=0) to run queue because: pid=14, 
state=RUNNABLE:SERVER_CRASH_ASSIGN_META, hasLock=false; ServerCrashProcedure 
server=localhost,102,1, splitWal=false, meta=true released exclusive lock
2018-11-29 20:47:49,756 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(351): Add ServerQueue(localhost,102,1, 
xlock=false sharedLock=0 size=1) to run queue because: the exclusive lock is 
not held by anyone when adding pid=14, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, 
hasLock=false; ServerCrashProcedure server=localhost,102,1, splitWal=false, 
meta=true
2018-11-29 20:47:49,756 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(361): Remove ServerQueue(localhost,102,1, 
xlock=false sharedLock=0 size=0) from run queue because: queue is empty after 
polling out pid=14, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, hasLock=false; 
ServerCrashProcedure server=localhost,102,1, splitWal=false, meta=true
2018-11-29 20:47:49,756 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(361): Remove ServerQueue(localhost,102,1, 
xlock=true (14) sharedLock=0 size=0) from run queue because: pid=14, 
state=RUNNABLE:SERVER_CRASH_ASSIGN_META, hasLock=false; ServerCrashProcedure 
server=localhost,102,1, splitWal=false, meta=true held exclusive lock
2018-11-29 20:47:49,758 INFO  [PEWorker-63] procedure2.ProcedureExecutor(1680): 
Initialized subprocedures=[{pid=15, ppid=14, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN}]
2018-11-29 20:47:49,758 DEBUG [PEWorker-63] procedure2.RootProcedureState(153): 
Add procedure pid=14, state=WAITING:SERVER_CRASH_GET_REGIONS, hasLock=true; 
ServerCrashProcedure server=localhost,102,1, splitWal=false, meta=true as the 
2th rollback step
2018-11-29 20:47:49,758 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(351): Add TableQueue(hbase:meta, xlock=false 
sharedLock=0 size=1) to run queue because: the exclusive lock is not held by 
anyone when adding pid=15, ppid=14, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN
2018-11-29 20:47:49,758 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(351): Add ServerQueue(localhost,102,1, 
xlock=false sharedLock=0 size=0) to run queue because: pid=14, 
state=WAITING:SERVER_CRASH_GET_REGIONS, hasLock=false; ServerCrashProcedure 
server=localhost,102,1, splitWal=false, meta=true released exclusive lock
2018-11-29 20:47:49,758 DEBUG [PEWorker-63] 
procedure.MasterProcedureScheduler(361): Remove TableQueue(hbase:meta, 
xlock=false sharedLock=0 size=0) from run queue because: queue is empty after 
polling out pid=15, ppid=14, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN
2018-11-29 20:47:49,758 INFO  [PEWorker-63] 
procedure.MasterProcedureScheduler(736): Took xlock for pid=15, ppid=14, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN
2018-11-29 20:47:49,758 INFO  [PEWorker-63] 
assignment.TransitRegionStateProcedure(166): Starting pid=15, ppid=14, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
rit=OPEN, location=localhost,102,1; forceNewPlan=false, retain=false
2018-11-29 20:47:49,759 DEBUG [PEWorker-63] procedure2.RootProcedureState(153): 
Add procedure pid=15, ppid=14, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, 
hasLock=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
ASSIGN as the 3th rollback step
2018-11-29 20:47:49,909 INFO  [mockmaster:1234] 
balancer.BaseLoadBalancer(1532): Reassigned 1 regions. 1 retained the 
pre-restart assignment. 
2018-11-29 20:47:49,909 DEBUG [mockmaster:1234] 
procedure.MasterProcedureScheduler(351): Add TableQueue(hbase:meta, xlock=false 
sharedLock=1 size=1) to run queue because: pid=15, ppid=14, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, hasLock=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN has lock
2018-11-29 20:47:49,909 DEBUG [PEWorker-36] 
procedure.MasterProcedureScheduler(361): Remove TableQueue(hbase:meta, 
xlock=false sharedLock=1 size=0) from run queue because: queue is empty after 
polling out pid=15, ppid=14, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, 
hasLock=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
ASSIGN
2018-11-29 20:47:49,910 INFO  [PEWorker-36] procedure2.ProcedureExecutor(1680): 
Initialized subprocedures=[{pid=16, ppid=15, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
2018-11-29 20:47:49,910 DEBUG [PEWorker-36] procedure2.RootProcedureState(153): 
Add procedure pid=15, ppid=14, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN as the 
4th rollback step
2018-11-29 20:47:49,910 DEBUG [PEWorker-36] 
procedure.MasterProcedureScheduler(351): Add TableQueue(hbase:meta, xlock=false 
sharedLock=1 size=1) to run queue because: the exclusive lock is not held by 
anyone when adding pid=16, ppid=15, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
2018-11-29 20:47:49,910 DEBUG [PEWorker-36] 
procedure.MasterProcedureScheduler(361): Remove TableQueue(hbase:meta, 
xlock=false sharedLock=1 size=0) from run queue because: queue is empty after 
polling out pid=16, ppid=15, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
2018-11-29 20:47:49,910 DEBUG [PEWorker-36] procedure2.RootProcedureState(153): 
Add procedure pid=16, ppid=15, state=RUNNABLE, hasLock=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 5th 
rollback step
2018-11-29 20:47:50,061 WARN  [MockRSProcedureDispatcher-pool5-t10] 
assignment.AssignmentManager(894): The region server localhost,102,1 is already 
dead, skip reportRegionStateTransition call
{noformat}

> TestAssignmentManager is flakey
> -------------------------------
>
>                 Key: HBASE-21534
>                 URL: https://issues.apache.org/jira/browse/HBASE-21534
>             Project: HBase
>          Issue Type: Task
>            Reporter: Duo Zhang
>            Priority: Major
>
> See this in the outout and then the test hang
> {noformat}
> 2018-11-29 20:47:50,061 WARN  [MockRSProcedureDispatcher-pool5-t10] 
> assignment.AssignmentManager(894): The region server localhost,102,1 is 
> already dead, skip reportRegionStateTransition call
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to