[
https://issues.apache.org/jira/browse/HBASE-23282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972926#comment-16972926
]
Michael Stack commented on HBASE-23282:
---------------------------------------
This is hard to read but it illustrates the above. There ARE regions in
hbase:meta that reference server.example.com but the below SCP run doesn't find
them:
{code}
2019-11-11 17:54:03,136 DEBUG
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Stored pid=442039,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=server.example.com,16020,1573370369484, splitWal=true, meta=false
2019-11-11 17:54:03,136 DEBUG
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add
ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0
size=1) to run queue because: the exclusive lock is not held by anyone when
adding pid=442039, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=server.example.com,16020, 1573370369484, splitWal=true, meta=false
2019-11-11 17:54:03,138 DEBUG
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove
ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0
size=0) from run queue because: queue is empty after polling out pid=442039,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=server.example.com,16020,1573370369484, splitWal=true, meta=false
2019-11-11 17:54:03,138 DEBUG
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove
ServerQueue(server.example.com,16020,1573370369484, xlock=true (442039)
sharedLock=0 size=0) from run queue because: pid=442039,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=server.example.com,16020,1573370369484, splitWal=true,
meta=false held exclusive lock
2019-11-11 17:54:03,140 DEBUG org.apache.hadoop.hbase.master.DeadServer:
Started processing server.example.com,16020,1573370369484; numProcessing=1
2019-11-11 17:54:03,140 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start
pid=442039, state=RUNNABLE:SERVER_CRASH_START, locked=true;
ServerCrashProcedure server=server.example.com,16020,1573370369484,
splitWal=true, meta=false
2019-11-11 17:54:03,140 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=442039, state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true;
ServerCrashProcedure server=server.example.com,16020,1573370369484,
splitWal=true, meta=false as the 0th rollback step
2019-11-11 17:54:03,142 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure:
server.example.com,16020,1573370369484 had 0 regions
2019-11-11 17:54:03,142 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true;
ServerCrashProcedure server=server.example.com,16020,1573370369484,
splitWal=true, meta=false as the 1th rollback step
2019-11-11 17:54:03,143 DEBUG
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Splitting WALs
pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true;
ServerCrashProcedure server=server.example.com,16020,1573370369484,
splitWal=true, meta=false
2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.MasterWalManager:
Log dir for server server.example.com,16020,1573370369484 does not exist
2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.SplitLogManager:
dead splitlog workers [server.example.com,16020,1573370369484]
2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.SplitLogManager:
Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [] in
0ms
2019-11-11 17:54:03,145 DEBUG
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting
WALs pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true;
ServerCrashProcedure server=server.example.com,16020,1573370369484,
splitWal=true, meta=false
2019-11-11 17:54:03,146 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=442039, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
ServerCrashProcedure server=server.example.com,16020,1573370369484,
splitWal=true, meta=false as the 2th rollback step
2019-11-11 17:54:03,147 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=442039, state=RUNNABLE:SERVER_CRASH_FINISH, locked=true;
ServerCrashProcedure server=server.example.com,16020,1573370369484,
splitWal=true, meta=false as the 3th rollback step
2019-11-11 17:54:03,148 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: removed crashed
server server.example.com,16020,1573370369484 after splitting done
2019-11-11 17:54:03,149 DEBUG org.apache.hadoop.hbase.master.DeadServer:
Finished processing server.example.com,16020,1573370369484; numProcessing=0
2019-11-11 17:54:03,149 DEBUG org.apache.hadoop.hbase.master.DeadServer:
Removed server.example.com,16020,1573370369484 ; numProcessing=0
2019-11-11 17:54:03,149 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=442039, state=SUCCESS, locked=true; ServerCrashProcedure
server=server.example.com,16020,1573370369484, splitWal=true, meta=false as the
4th rollback step
2019-11-11 17:54:03,151 DEBUG
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add
ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0
size=0) to run queue because: pid=442039, state=SUCCESS; ServerCrashProcedure
server=server.example.com,16020,1573370369484, splitWal=true, meta=false
released exclusive lock
2019-11-11 17:54:03,151 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=442039,
state=SUCCESS; ServerCrashProcedure
server=server.example.com,16020,1573370369484, splitWal=true, meta=false in
115msec
2019-11-11 17:54:03,151 DEBUG
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove
ServerQueue(server.example.com,16020,1573370369484, xlock=true (442039)
sharedLock=0 size=0) from run queue because: clean up server queue after
pid=442039, state=SUCCESS; ServerCrashProcedure
server=server.example.com,16020,1573370369484, splitWal=true, meta=false
completed
2019-11-11 17:54:05,560 DEBUG org.apache.hadoop.hbase.master.ServerManager:
REPORT: Server server.example.com,16020,1573492804150 came back up, removed it
from the dead servers list
{code}
> HBCKServerCrashProcedure for 'Unknown Servers'
> ----------------------------------------------
>
> Key: HBASE-23282
> URL: https://issues.apache.org/jira/browse/HBASE-23282
> Project: HBase
> Issue Type: Bug
> Components: hbck2, proc-v2
> Affects Versions: 2.2.2
> Reporter: Michael Stack
> Priority: Major
>
> With an overdriving, sustained load, I can fairly easily manufacture an
> hbase:meta table that references servers that are no longer in the live list
> nor are members of deadservers; i.e. 'Unknown Servers'. The new 'HBCK
> Report' UI in Master has a section where it lists 'Unknown Servers' if any in
> hbase:meta.
> Once in this state, the repair is awkward. Our assign/unassign Procedure is
> particularly dogged about insisting that we confirm close/open of Regions
> when it is going about its business which is well and good if server is in
> live/dead sets but when an 'Unknown Server', we invariably end up trying to
> confirm against a non-longer present server (More on this in follow-on
> issues).
> What is wanted is queuing of a ServerCrashProcedure for each 'Unknown
> Server'. It would split any WALs (there shouldn't be any if server was
> restarted) and ideally it would cancel out any assigns and reassign regions
> off the 'Unknown Server'. But the 'normal' SCP consults the in-memory
> cluster state figuring what Regions were on the crashed server... And
> 'Unknown Servers' don't have state in in-master memory Maps of Servers to
> Regions or in DeadServers list which works fine for the usual case.
> Suggestion here is that hbck2 be able to drive in a special SCP, one which
> would get list of Regions by scanning hbase:meta rather than asking Master
> memory; an HBCKSCP.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)