[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

Michael Stack (Jira) Tue, 12 Nov 2019 16:57:16 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-23282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972926#comment-16972926
 ]


Michael Stack commented on HBASE-23282:
---------------------------------------

This is hard to read but it illustrates the above. There ARE regions in 
hbase:meta that reference server.example.com but the below SCP run doesn't find 
them:
{code}
 2019-11-11 17:54:03,136 DEBUG 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Stored pid=442039, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=server.example.com,16020,1573370369484, splitWal=true, meta=false
 2019-11-11 17:54:03,136 DEBUG 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0 
size=1) to run queue because: the exclusive lock is not held by anyone when 
adding pid=442039, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=server.example.com,16020, 1573370369484, splitWal=true, meta=false
 2019-11-11 17:54:03,138 DEBUG 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove 
ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0 
size=0) from run queue because: queue is empty after polling out pid=442039, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=server.example.com,16020,1573370369484,  splitWal=true, meta=false
 2019-11-11 17:54:03,138 DEBUG 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove 
ServerQueue(server.example.com,16020,1573370369484, xlock=true (442039) 
sharedLock=0 size=0) from run queue because: pid=442039, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=server.example.com,16020,1573370369484, splitWal=true,            
meta=false held exclusive lock
 2019-11-11 17:54:03,140 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Started processing server.example.com,16020,1573370369484; numProcessing=1
 2019-11-11 17:54:03,140 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start 
pid=442039, state=RUNNABLE:SERVER_CRASH_START, locked=true; 
ServerCrashProcedure server=server.example.com,16020,1573370369484, 
splitWal=true, meta=false
 2019-11-11 17:54:03,140 DEBUG 
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
pid=442039, state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; 
ServerCrashProcedure server=server.example.com,16020,1573370369484, 
splitWal=true, meta=false as the 0th rollback step
 2019-11-11 17:54:03,142 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: 
server.example.com,16020,1573370369484 had 0 regions
 2019-11-11 17:54:03,142 DEBUG 
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; 
ServerCrashProcedure server=server.example.com,16020,1573370369484, 
splitWal=true, meta=false as the 1th rollback step
 2019-11-11 17:54:03,143 DEBUG 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Splitting WALs 
pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; 
ServerCrashProcedure server=server.example.com,16020,1573370369484, 
splitWal=true, meta=false
 2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.MasterWalManager: 
Log dir for server server.example.com,16020,1573370369484 does not exist
 2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
dead splitlog workers [server.example.com,16020,1573370369484]
 2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [] in 
0ms
 2019-11-11 17:54:03,145 DEBUG 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting 
WALs pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; 
ServerCrashProcedure server=server.example.com,16020,1573370369484, 
splitWal=true, meta=false
 2019-11-11 17:54:03,146 DEBUG 
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
pid=442039, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
ServerCrashProcedure server=server.example.com,16020,1573370369484, 
splitWal=true, meta=false as the 2th rollback step
 2019-11-11 17:54:03,147 DEBUG 
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
pid=442039, state=RUNNABLE:SERVER_CRASH_FINISH, locked=true; 
ServerCrashProcedure server=server.example.com,16020,1573370369484, 
splitWal=true, meta=false as the 3th rollback step
 2019-11-11 17:54:03,148 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: removed crashed 
server server.example.com,16020,1573370369484 after splitting done
 2019-11-11 17:54:03,149 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Finished processing server.example.com,16020,1573370369484; numProcessing=0
 2019-11-11 17:54:03,149 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Removed server.example.com,16020,1573370369484 ; numProcessing=0
 2019-11-11 17:54:03,149 DEBUG 
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
pid=442039, state=SUCCESS, locked=true; ServerCrashProcedure 
server=server.example.com,16020,1573370369484, splitWal=true, meta=false as the 
4th rollback step
 2019-11-11 17:54:03,151 DEBUG 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0 
size=0) to run queue because: pid=442039, state=SUCCESS; ServerCrashProcedure 
server=server.example.com,16020,1573370369484, splitWal=true, meta=false 
released exclusive lock
 2019-11-11 17:54:03,151 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=442039, 
state=SUCCESS; ServerCrashProcedure 
server=server.example.com,16020,1573370369484, splitWal=true, meta=false in 
115msec
 2019-11-11 17:54:03,151 DEBUG 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove 
ServerQueue(server.example.com,16020,1573370369484, xlock=true (442039) 
sharedLock=0 size=0) from run queue because: clean up server queue after 
pid=442039, state=SUCCESS; ServerCrashProcedure 
server=server.example.com,16020,1573370369484, splitWal=true,    meta=false 
completed
 2019-11-11 17:54:05,560 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
REPORT: Server server.example.com,16020,1573492804150 came back up, removed it 
from the dead servers list
{code}

> HBCKServerCrashProcedure for 'Unknown Servers'
> ----------------------------------------------
>
>                 Key: HBASE-23282
>                 URL: https://issues.apache.org/jira/browse/HBASE-23282
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2, proc-v2
>    Affects Versions: 2.2.2
>            Reporter: Michael Stack
>            Priority: Major
>
> With an overdriving, sustained load, I can fairly easily manufacture an 
> hbase:meta table that references servers that are no longer in the live list 
> nor are members of deadservers; i.e. 'Unknown Servers'.  The new 'HBCK 
> Report' UI in Master has a section where it lists 'Unknown Servers' if any in 
> hbase:meta.
> Once in this state, the repair is awkward. Our assign/unassign Procedure is 
> particularly dogged about insisting that we confirm close/open of Regions 
> when it is going about its business which is well and good if server is in 
> live/dead sets but when an 'Unknown Server', we invariably end up trying to 
> confirm against a non-longer present server (More on this in follow-on 
> issues).
> What is wanted is queuing of a ServerCrashProcedure for each 'Unknown 
> Server'. It would split any WALs (there shouldn't be any if server was 
> restarted) and ideally it would cancel out any assigns and reassign regions 
> off the 'Unknown Server'.  But the 'normal' SCP consults the in-memory 
> cluster state figuring what Regions were on the crashed server... And 
> 'Unknown Servers' don't have state in in-master memory Maps of Servers to 
> Regions or  in DeadServers list which works fine for the usual case.
> Suggestion here is that hbck2 be able to drive in a special SCP, one which 
> would get list of Regions by scanning hbase:meta rather than asking Master 
> memory; an HBCKSCP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

Reply via email to