[jira] [Commented] (HBASE-21843) AM misses region assignment in catastrophic scenarios where RS assigned to the region in Meta does not have a WAL dir.

Wellington Chevreuil (JIRA) Tue, 05 Feb 2019 03:29:34 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760706#comment-16760706
 ]


Wellington Chevreuil commented on HBASE-21843:
----------------------------------------------

Hey [~Apache9],
{quote}Do you see the update meta log in master log that we update meta for the 
given regions?
{quote}
I guess so, yes. Sorry for the delay in responding and the verbosity on this 
comment. I guess it's worth having a second pair of eyes to review the logs on 
this:

1) Cluster is restarted. Master log:
{noformat}
2019-01-31 16:30:56,685 INFO org.apache.hadoop.hbase.master.HMaster: STARTING 
service HMaster
{noformat}
2) SCP is started for RS "wellington-2.example.com,22101,1548980456156":
{noformat}
2019-01-31 16:32:56,931 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=3, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=wellington-2.example.com,22101,1548980456156, splitWal=true, meta=false
{noformat}
3) This RS had some regions assigned to it. Assign sub procs are created to 
re-assign its regions:
{noformat}
2019-01-31 16:33:08,884 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
subprocedures=[{pid=5, ppid=3, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
AssignProcedure table=hbase:namespace, 
region=06f8b38a9d971443ea8afcc87ec249ed}, {pid=6, ppid=3, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=training:test1, 
region=7af296556834a9be9ad12f71e92437eb}, {pid=7, ppid=3, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=training:test1, 
region=9e44b81a652ed371f6734ad9b7684018}, {pid=8, ppid=3, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=cluster_test, 
region=2e6ba53254da5d950018d9c02e20aa0f}, {pid=9, ppid=3, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=cluster_test, 
region=ef0a7c686ae29be3ffe54fdfb9c079b1}]
{noformat}
4) And here are logs from AM loadMeta() method, showing some regions for 
"wellington-2.example.com,22101,1548980456156".  Focusing on region 
"06f8b38a9d971443ea8afcc87ec249ed", from namespace table, which will be handled 
by Assign sub proc id=5:
{noformat}
2019-01-31 16:33:08,651 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta 
entry region=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPEN, 
lastHost=wellington-2.example.com,22101,1548980456156, 
regionLocation=wellington-2.example.com,22101,1548980456156, openSeqNum=23
{noformat}
5) Here is the splitlog for that RS, where its wal dir is gone after this point:
{noformat}
2019-01-31 16:33:08,763 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-2.example.com,22101,1548980456156-splitting
 dir is empty, no logs to split.
2019-01-31 16:33:08,769 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
Finished splitting (more than or equal to) 0 bytes in 0 log files in 
[hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-2.example.com,22101,1548980456156-splitting]
 in 0ms
{noformat}
6) The Assign sub proc id 5 starting for region 
06f8b38a9d971443ea8afcc87ec249ed:
{noformat}
2019-01-31 16:33:08,922 INFO 
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=5, 
ppid=3, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure 
table=hbase:namespace, region=06f8b38a9d971443ea8afcc87ec249ed; rit=OFFLINE, 
location=wellington-2.example.com,22101,1548980456156; forceNewPlan=false, 
retain=true
{noformat}
7) We can see proc id 5 opening the region at new server start code 
1548981048660:
{noformat}
2019-01-31 16:33:09,076 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=5 updating 
hbase:meta row=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPENING, 
regionLocation=wellington-2.example.com,22101,1548981048660
{noformat}
8) Skipping some intermediate status for proc id 5 and jumping to its final 
logs, we see it updating meta with region 06f8b38a9d971443ea8afcc87ec249ed as 
OPEN at RS with startcode 1548981048660:
{noformat}
2019-01-31 16:33:11,316 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=5 updating 
hbase:meta row=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPEN, 
openSeqNum=26, regionLocation=wellington-2.example.com,22101,1548981048660
{noformat}
9) And here the final message for proc id 5 saying it has succeeded:
{noformat}
2019-01-31 16:33:11,388 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=5, ppid=3, 
state=SUCCESS; AssignProcedure table=hbase:namespace, 
region=06f8b38a9d971443ea8afcc87ec249ed in 2.4550sec, unfinishedSiblingCount=3
{noformat}
10) The SCP proc id 3 also finishes as success. I guess this answers [~stack] 
question on why the SCP wasn't re-ran upon restart:
{noformat}
2019-01-31 16:33:12,191 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3, 
state=SUCCESS; ServerCrashProcedure 
server=wellington-2.example.com,22101,1548980456156, splitWal=true, meta=false 
in 16.3190sec
{noformat}
11) Few seconds later HDFS enters safemode, then master aborts around 10 
minutes later:
{noformat}
2019-01-31 16:33:23,283 WARN org.apache.hadoop.hdfs.client.impl.LeaseRenewer: 
Failed to renew lease for [DFSClient_NONMAPREDUCE_-1362593737_1] for 30 
seconds.  Will retry shortly ...
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
 Cannot renew lease for DFSClient_NONMAPREDUCE_-1362593737_1. Name node is in 
safe mode.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe 
mode off. NamenodeHostName:wellington-1.example.com
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1448)
...
2019-01-31 16:42:44,878 ERROR org.apache.hadoop.hbase.master.HMaster: ***** 
ABORTING master wellington-1.example.com,22001,1548981057654: Shutting down 
HBase cluster: file system not available *****
{noformat}
12) Master is restarted on the next day:
{noformat}
2019-02-01 13:59:22,479 INFO org.apache.hadoop.hbase.master.HMaster: STARTING 
service HMaster
{noformat}
13) An SCP is started to RS startcode 1548981048660:
{noformat}
2019-02-01 13:59:38,437 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=24, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=wellington-2.example.com,22101,1548981048660, splitWal=true, meta=false
{noformat}
14) Split logs shows some problems:
{noformat}
2019-02-01 13:59:38,594 WARN org.apache.hadoop.hbase.master.SplitLogManager: 
Returning success without actually splitting and deleting all the log files in 
path 
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting:
 
[HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234;
 isDirectory=false; length=24420; replication=3; blocksize=268435456; 
modification_time=1549058224769; access_time=1548981183243; owner=hbase; 
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false; 
isEncrypted=true; isErasureCoded=false}, 
HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.regiongroup-0.1548981180937;
 isDirectory=false; length=2628; replication=3; blocksize=268435456; 
modification_time=1549058224751; access_time=1548981180956; owner=hbase; 
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false; 
isEncrypted=true; isErasureCoded=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
 `/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting is non 
empty': Directory is not empty
…
2019-02-01 13:59:38,619 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
Finished splitting (more than or equal to) 0 bytes in 0 log files in 
[hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting]
 in 0ms
{noformat}
15) Then loadMeta() logs for region 06f8b38a9d971443ea8afcc87ec249ed show it's 
still assigned to old start code1548980456156, when it actually should be 
1548981048660, per the logs seen in bullet #8:
{noformat}
2019-02-01 13:59:50,249 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta 
entry region=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPEN, 
lastHost=wellington-2.example.com,22101,1548980456156, 
regionLocation=wellington-2.example.com,22101,1548980456156, openSeqNum=23
{noformat}
16) Also noticed the SCP id 24 for startcode 1548981048660 does not find any 
regions to assign, only other log message mentioning it after it got created is 
the one saying it’s finished:
{noformat}
2019-02-01 13:59:52,071 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=24, 
state=SUCCESS; ServerCrashProcedure 
server=wellington-2.example.com,22101,1548981048660, splitWal=true, meta=false 
in 14.5080sec
{noformat}
17) At this point, region will never come online, and since it’s a namespace 
table region, whole master will not be operational, with such warns being 
logged periodically (note the old startcode 1548980456156 being referred):
{noformat}
2019-02-01 13:59:53,688 WARN org.apache.hadoop.hbase.master.HMaster: 
hbase:namespace,,1548263896028.06f8b38a9d971443ea8afcc87ec249ed. is NOT online; 
state={06f8b38a9d971443ea8afcc87ec249ed state=OPEN, ts=1549058390249, 
server=wellington-2.example.com,22101,1548980456156}; 
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern 
until region onlined.
2019-02-01 13:59:57,688 WARN org.apache.hadoop.hbase.master.HMaster: 
hbase:namespace,,1548263896028.06f8b38a9d971443ea8afcc87ec249ed. is NOT online; 
state={06f8b38a9d971443ea8afcc87ec249ed state=OPEN, ts=1549058390249, 
server=wellington-2.example.com,22101,1548980456156}; 
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern 
until region onlined.
2019-02-01 14:00:05,689 WARN org.apache.hadoop.hbase.master.HMaster: 
hbase:namespace,,1548263896028.06f8b38a9d971443ea8afcc87ec249ed. is NOT online; 
state={06f8b38a9d971443ea8afcc87ec249ed state=OPEN, ts=1549058390249, 
server=wellington-2.example.com,22101,1548980456156}; 
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern 
until region onlined.
{noformat}

> AM misses region assignment in catastrophic scenarios where RS assigned to 
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21843
>                 URL: https://issues.apache.org/jira/browse/HBASE-21843
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.1.0, 2.2.0
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>         Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and 
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into 
> a situation where regions on meta were assigned to a give RS startcode that 
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS 
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no 
> SCP is created to this RS startcode, leaving the region "open" on a dead 
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS 
> assigned to the region in meta is not online and doesn't have a WAL dir, then 
> mark this region as offline. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21843) AM misses region assignment in catastrophic scenarios where RS assigned to the region in Meta does not have a WAL dir.

Reply via email to