[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

Fabrice Rabaute (Jira) Wed, 12 Feb 2020 19:23:11 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-23282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035882#comment-17035882
 ]


Fabrice Rabaute commented on HBASE-23282:
-----------------------------------------

Yes, COLUMN CELL info is for region 353ab75c788cd0f77027706900453c49.

 

So this region was listed as part of a region in "Unknow Servers".

When hbase 2.2.1 was running, I scheduled a scheduleProcedureRecoveries with 
hbck2 and it was failing with a java exception, the same one as expected before 
the fix in 2.2.3.

So I upgraded the cluster to 2.2.3. Then, I could schedule a 
scheduleProcedureRecoveries with hbck2. This time is was not failing anymore, 
(no java exception and exit status was 0), but nothing was happening. No change.

I tried several times. I restarted the master as well (to be sure cache was not 
involved or part of the problem), several times, but still, the SCP was running 
but doing nothing for this server.

Then, I tried to fix it in a different way by doing a "put" in the meta of the 
region to override the entry. I restarted master, which started to Crash :(

So, I did a "deleteall" for this region in meta. Then master could start again, 
so now this region is not in hbase:meta but still in hdfs, so I suppose I will 
recover it with the bulk load process now.

The Unknown Server disappeared from the hbck page from this region, which is 
good, but it's because I deleted the region info in hbase:meta. I don't think 
that's the right way to do, I probably did it wrong.

As I don't have "Unknown Servers" anymore, I cannot reproduce this particular 
case.

 

> HBCKServerCrashProcedure for 'Unknown Servers'
> ----------------------------------------------
>
>                 Key: HBASE-23282
>                 URL: https://issues.apache.org/jira/browse/HBASE-23282
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2, proc-v2
>    Affects Versions: 2.2.2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0, 2.2.3
>
>
> With an overdriving, sustained load, I can fairly easily manufacture an 
> hbase:meta table that references servers that are no longer in the live list 
> nor are members of deadservers; i.e. 'Unknown Servers'.  The new 'HBCK 
> Report' UI in Master has a section where it lists 'Unknown Servers' if any in 
> hbase:meta.
> Once in this state, the repair is awkward. Our assign/unassign Procedure is 
> particularly dogged about insisting that we confirm close/open of Regions 
> when it is going about its business which is well and good if server is in 
> live/dead sets but when an 'Unknown Server', we invariably end up trying to 
> confirm against a non-longer present server (More on this in follow-on 
> issues).
> What is wanted is queuing of a ServerCrashProcedure for each 'Unknown 
> Server'. It would split any WALs (there shouldn't be any if server was 
> restarted) and ideally it would cancel out any assigns and reassign regions 
> off the 'Unknown Server'.  But the 'normal' SCP consults the in-memory 
> cluster state figuring what Regions were on the crashed server... And 
> 'Unknown Servers' don't have state in in-master memory Maps of Servers to 
> Regions or  in DeadServers list which works fine for the usual case.
> Suggestion here is that hbck2 be able to drive in a special SCP, one which 
> would get list of Regions by scanning hbase:meta rather than asking Master 
> memory; an HBCKSCP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

Reply via email to