[
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760706#comment-16760706
]
Wellington Chevreuil commented on HBASE-21843:
----------------------------------------------
Hey [~Apache9],
{quote}Do you see the update meta log in master log that we update meta for the
given regions?
{quote}
I guess so, yes. Sorry for the delay in responding and the verbosity on this
comment. I guess it's worth having a second pair of eyes to review the logs on
this:
1) Cluster is restarted. Master log:
{noformat}
2019-01-31 16:30:56,685 INFO org.apache.hadoop.hbase.master.HMaster: STARTING
service HMaster
{noformat}
2) SCP is started for RS "wellington-2.example.com,22101,1548980456156":
{noformat}
2019-01-31 16:32:56,931 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=3,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=wellington-2.example.com,22101,1548980456156, splitWal=true, meta=false
{noformat}
3) This RS had some regions assigned to it. Assign sub procs are created to
re-assign its regions:
{noformat}
2019-01-31 16:33:08,884 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized
subprocedures=[{pid=5, ppid=3, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:namespace,
region=06f8b38a9d971443ea8afcc87ec249ed}, {pid=6, ppid=3,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=training:test1,
region=7af296556834a9be9ad12f71e92437eb}, {pid=7, ppid=3,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=training:test1,
region=9e44b81a652ed371f6734ad9b7684018}, {pid=8, ppid=3,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=cluster_test,
region=2e6ba53254da5d950018d9c02e20aa0f}, {pid=9, ppid=3,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=cluster_test,
region=ef0a7c686ae29be3ffe54fdfb9c079b1}]
{noformat}
4) And here are logs from AM loadMeta() method, showing some regions for
"wellington-2.example.com,22101,1548980456156". Focusing on region
"06f8b38a9d971443ea8afcc87ec249ed", from namespace table, which will be handled
by Assign sub proc id=5:
{noformat}
2019-01-31 16:33:08,651 INFO
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta
entry region=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPEN,
lastHost=wellington-2.example.com,22101,1548980456156,
regionLocation=wellington-2.example.com,22101,1548980456156, openSeqNum=23
{noformat}
5) Here is the splitlog for that RS, where its wal dir is gone after this point:
{noformat}
2019-01-31 16:33:08,763 INFO org.apache.hadoop.hbase.master.SplitLogManager:
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-2.example.com,22101,1548980456156-splitting
dir is empty, no logs to split.
2019-01-31 16:33:08,769 INFO org.apache.hadoop.hbase.master.SplitLogManager:
Finished splitting (more than or equal to) 0 bytes in 0 log files in
[hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-2.example.com,22101,1548980456156-splitting]
in 0ms
{noformat}
6) The Assign sub proc id 5 starting for region
06f8b38a9d971443ea8afcc87ec249ed:
{noformat}
2019-01-31 16:33:08,922 INFO
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=5,
ppid=3, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure
table=hbase:namespace, region=06f8b38a9d971443ea8afcc87ec249ed; rit=OFFLINE,
location=wellington-2.example.com,22101,1548980456156; forceNewPlan=false,
retain=true
{noformat}
7) We can see proc id 5 opening the region at new server start code
1548981048660:
{noformat}
2019-01-31 16:33:09,076 INFO
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=5 updating
hbase:meta row=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPENING,
regionLocation=wellington-2.example.com,22101,1548981048660
{noformat}
8) Skipping some intermediate status for proc id 5 and jumping to its final
logs, we see it updating meta with region 06f8b38a9d971443ea8afcc87ec249ed as
OPEN at RS with startcode 1548981048660:
{noformat}
2019-01-31 16:33:11,316 INFO
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=5 updating
hbase:meta row=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPEN,
openSeqNum=26, regionLocation=wellington-2.example.com,22101,1548981048660
{noformat}
9) And here the final message for proc id 5 saying it has succeeded:
{noformat}
2019-01-31 16:33:11,388 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=5, ppid=3,
state=SUCCESS; AssignProcedure table=hbase:namespace,
region=06f8b38a9d971443ea8afcc87ec249ed in 2.4550sec, unfinishedSiblingCount=3
{noformat}
10) The SCP proc id 3 also finishes as success. I guess this answers [~stack]
question on why the SCP wasn't re-ran upon restart:
{noformat}
2019-01-31 16:33:12,191 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3,
state=SUCCESS; ServerCrashProcedure
server=wellington-2.example.com,22101,1548980456156, splitWal=true, meta=false
in 16.3190sec
{noformat}
11) Few seconds later HDFS enters safemode, then master aborts around 10
minutes later:
{noformat}
2019-01-31 16:33:23,283 WARN org.apache.hadoop.hdfs.client.impl.LeaseRenewer:
Failed to renew lease for [DFSClient_NONMAPREDUCE_-1362593737_1] for 30
seconds. Will retry shortly ...
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
Cannot renew lease for DFSClient_NONMAPREDUCE_-1362593737_1. Name node is in
safe mode.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe
mode off. NamenodeHostName:wellington-1.example.com
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1448)
...
2019-01-31 16:42:44,878 ERROR org.apache.hadoop.hbase.master.HMaster: *****
ABORTING master wellington-1.example.com,22001,1548981057654: Shutting down
HBase cluster: file system not available *****
{noformat}
12) Master is restarted on the next day:
{noformat}
2019-02-01 13:59:22,479 INFO org.apache.hadoop.hbase.master.HMaster: STARTING
service HMaster
{noformat}
13) An SCP is started to RS startcode 1548981048660:
{noformat}
2019-02-01 13:59:38,437 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=24,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=wellington-2.example.com,22101,1548981048660, splitWal=true, meta=false
{noformat}
14) Split logs shows some problems:
{noformat}
2019-02-01 13:59:38,594 WARN org.apache.hadoop.hbase.master.SplitLogManager:
Returning success without actually splitting and deleting all the log files in
path
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting:
[HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234;
isDirectory=false; length=24420; replication=3; blocksize=268435456;
modification_time=1549058224769; access_time=1548981183243; owner=hbase;
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false;
isEncrypted=true; isErasureCoded=false},
HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.regiongroup-0.1548981180937;
isDirectory=false; length=2628; replication=3; blocksize=268435456;
modification_time=1549058224751; access_time=1548981180956; owner=hbase;
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false;
isEncrypted=true; isErasureCoded=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
`/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting is non
empty': Directory is not empty
…
2019-02-01 13:59:38,619 INFO org.apache.hadoop.hbase.master.SplitLogManager:
Finished splitting (more than or equal to) 0 bytes in 0 log files in
[hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting]
in 0ms
{noformat}
15) Then loadMeta() logs for region 06f8b38a9d971443ea8afcc87ec249ed show it's
still assigned to old start code1548980456156, when it actually should be
1548981048660, per the logs seen in bullet #8:
{noformat}
2019-02-01 13:59:50,249 INFO
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta
entry region=06f8b38a9d971443ea8afcc87ec249ed, regionState=OPEN,
lastHost=wellington-2.example.com,22101,1548980456156,
regionLocation=wellington-2.example.com,22101,1548980456156, openSeqNum=23
{noformat}
16) Also noticed the SCP id 24 for startcode 1548981048660 does not find any
regions to assign, only other log message mentioning it after it got created is
the one saying it’s finished:
{noformat}
2019-02-01 13:59:52,071 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=24,
state=SUCCESS; ServerCrashProcedure
server=wellington-2.example.com,22101,1548981048660, splitWal=true, meta=false
in 14.5080sec
{noformat}
17) At this point, region will never come online, and since it’s a namespace
table region, whole master will not be operational, with such warns being
logged periodically (note the old startcode 1548980456156 being referred):
{noformat}
2019-02-01 13:59:53,688 WARN org.apache.hadoop.hbase.master.HMaster:
hbase:namespace,,1548263896028.06f8b38a9d971443ea8afcc87ec249ed. is NOT online;
state={06f8b38a9d971443ea8afcc87ec249ed state=OPEN, ts=1549058390249,
server=wellington-2.example.com,22101,1548980456156};
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern
until region onlined.
2019-02-01 13:59:57,688 WARN org.apache.hadoop.hbase.master.HMaster:
hbase:namespace,,1548263896028.06f8b38a9d971443ea8afcc87ec249ed. is NOT online;
state={06f8b38a9d971443ea8afcc87ec249ed state=OPEN, ts=1549058390249,
server=wellington-2.example.com,22101,1548980456156};
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern
until region onlined.
2019-02-01 14:00:05,689 WARN org.apache.hadoop.hbase.master.HMaster:
hbase:namespace,,1548263896028.06f8b38a9d971443ea8afcc87ec249ed. is NOT online;
state={06f8b38a9d971443ea8afcc87ec249ed state=OPEN, ts=1549058390249,
server=wellington-2.example.com,22101,1548980456156};
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern
until region onlined.
{noformat}
> AM misses region assignment in catastrophic scenarios where RS assigned to
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21843
> URL: https://issues.apache.org/jira/browse/HBASE-21843
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 3.0.0, 2.1.0, 2.2.0
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Major
> Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into
> a situation where regions on meta were assigned to a give RS startcode that
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no
> SCP is created to this RS startcode, leaving the region "open" on a dead
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS
> assigned to the region in meta is not online and doesn't have a WAL dir, then
> mark this region as offline.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)