[
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760853#comment-16760853
]
Wellington Chevreuil commented on HBASE-21843:
----------------------------------------------
Hey [~Apache9],
Yeah, really odd that meta table get's recovered to a state prior to that SCP
updates. To answer your previous questions, had to go for another round of log
reading. The suspicious meta split happened on a 3rd server, RS wellington-3,
and there were some edits skipped, despite it's reporting the splitting was
successful. Please find all my answers below (spoiler alert, #3 and #5 may be
the more revealing):
1)
{quote}So why your HDFS cluster entered the safe mode?
{quote}
I manually put into safemode to test another behaviour.
2)
{quote}Did all the region servers also crash and restart?
{quote}
Yes, since hdfs was left in safemode for a while, eventually all RSes crashed
when some internal feature that relies on HDFS write permissions tried to write
some file (like LogRoll, for instance).
3)
{quote}Which region server hosted the meta region when you updating it?
{quote}
Meta was on wellington-4.example.com, and had been reopened there. On restart
of "2019-01-31 16:30:56,685":
{noformat}
2019-01-31 16:32:56,931 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=1,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=wellington-4.example.com,22101,1548980460763, splitWal=true, meta=true
...
2019-01-31 16:32:57,521 INFO
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=4,
ppid=1, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure
table=hbase:meta, region=1588230740; rit=OFFLINE,
location=wellington-4.example.com,22101,1548980460763; forceNewPlan=false,
retain=true
...
2019-01-31 16:33:04,814 INFO
org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta
(replicaId=0) location in ZooKeeper as
wellington-4.example.com,22101,1548981048686
2019-01-31 16:33:04,871 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure
pid=4, resume processing parent pid=1, state=RUNNABLE:SERVER_CRASH_GET_REGIONS;
ServerCrashProcedure server=wellington-4.example.com,22101,1548980460763,
splitWal=true, meta=true
2019-01-31 16:33:04,873 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=4, ppid=1,
state=SUCCESS; AssignProcedure table=hbase:meta, region=1588230740 in
7.5720sec, unfinishedSiblingCount=0
2019-01-31 16:33:04,963 INFO
org.apache.hadoop.hbase.master.assignment.AssignmentManager: META REPORTED:
rit=OPEN, location=wellington-4.example.com,22101,1548981048686,
table=hbase:meta, region=1588230740
...
{noformat}
Now this is interesting, wellington-4 crashes and an SCP is triggered to
wellington-4 on master, just prior to the master abort:
{noformat}
2019-01-31 16:42:44,842 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=23,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=wellington-4.example.com,22101,1548981048686, splitWal=true, meta=true
2019-01-31 16:42:44,878 ERROR org.apache.hadoop.hbase.master.HMaster: *****
ABORTING master wellington-1.example.com,22001,1548981057654: Shutting down
HBase cluster: file system not available *****
java.io.IOException: File system is in safemode, it can't be written now
{noformat}
Since hdfs was already in safe mode here, I guess any attempt to save this SCP
state on masterprocwal will fail.
Jumping to the master restart, master tried to split log, but had some issues:
{noformat}
2019-02-01 13:59:38,594 WARN org.apache.hadoop.hbase.master.SplitLogManager:
Returning success without actually splitting and deleting all the log files in
path
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting:
[HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234;
isDirectory=false; length=24420; replication=3; blocksize=268435456;
modification_time=1549058224769; access_time=1548981183243; owner=hbase;
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false;
isEncrypted=true; isErasureCoded=false},
HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.regiongroup-0.1548981180937;
isDirectory=false; length=2628; replication=3; blocksize=268435456;
modification_time=1549058224751; access_time=1548981180956; owner=hbase;
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false;
isEncrypted=true; isErasureCoded=false}]
…
2019-02-01 13:59:38,619 INFO org.apache.hadoop.hbase.master.SplitLogManager:
Finished splitting (more than or equal to) 0 bytes in 0 log files in
[hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting]
in 0ms
{noformat}
Nevertheless, SCP proceeds with Assing meta subproc:
{noformat}
2019-02-01 13:59:39,057 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized
subprocedures=[{pid=26, ppid=23, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:meta, region=1588230740}]
2019-02-01 13:59:39,261 INFO
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
for pid=26, ppid=23, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:meta, region=1588230740
2019-02-01 13:59:39,310 INFO
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=26,
ppid=23, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure
table=hbase:meta, region=1588230740; rit=OFFLINE,
location=wellington-4.example.com,22101,1548981048686; forceNewPlan=false,
retain=true
{noformat}
And it opens meta on wellington-4 again:
{noformat}
2019-02-01 13:59:48,542 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=26, ppid=23,
state=SUCCESS; AssignProcedure table=hbase:meta, region=1588230740 in
9.4610sec, unfinishedSiblingCount=0
2019-02-01 13:59:48,679 INFO
org.apache.hadoop.hbase.master.assignment.AssignmentManager: META REPORTED:
rit=OPEN, location=wellington-4.example.com,22101,1549058360899,
table=hbase:meta, region=1588230740
{noformat}
On wellington-4:
{noformat}
2019-02-01 13:59:46,807 INFO
org.apache.hadoop.hbase.regionserver.RSRpcServices: Open
hbase:meta,,1.1588230740
…
2019-02-01 13:59:48,429 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Opened 1588230740; next sequenceid=443
2019-02-01 13:59:48,465 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks for
hbase:meta,,1.1588230740
{noformat}
4)
{quote}Is it still alive after HDFS came back?
{quote}
No, as mentioned above, hdfs was let in safemode for a while.
5)
{quote}If it was also dead, then did the log splitting work finish normally?
{quote}
According to master logs, yes, it did complete, I can see wellington-4
startcode=1548981048686 log splitting is performed on wellington-3 and finished:
{noformat}
2019-02-01 13:59:51,649 INFO
org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Task
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
acquired by wellington-3.example.com,22101,1549058360167
…
2019-02-01 13:59:52,309 INFO
org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Task
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
entered state=DONE wellington-3.example.com,22101,1549058360167
…
2019-02-01 13:59:52,335 INFO
org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Done
splitting
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
…
2019-02-01 13:59:52,401 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=23,
state=SUCCESS; ServerCrashProcedure
server=wellington-4.example.com,22101,1548981048686, splitWal=true, meta=true
in 21hrs, 17mins, 7.661sec
…
{noformat}
On wellington-3 logs:
{noformat}
2019-02-01 13:59:51,650 INFO
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination: worker
wellington-3.example.com,22101,1549058360167 acquired task
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
2019-02-01 13:59:51,736 INFO org.apache.hadoop.hbase.wal.WALSplitter: Splitting
WAL=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234,
length=24420
2019-02-01 13:59:51,746 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover
lease on dfs file
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234
2019-02-01 13:59:51,757 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:
Recovered lease, attempt=0 on
file=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234
after 11ms
2019-02-01 13:59:52,203 INFO org.apache.hadoop.hbase.wal.WALSplitter: 3 split
writers finished; closing.
2019-02-01 13:59:52,299 INFO org.apache.hadoop.hbase.wal.WALSplitter: Rename
hdfs://wellington-1.example.com:8020/hbase/hbase/meta/1588230740/recovered.edits/0000000000000000445-wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234.temp
to
hdfs://wellington-1.example.com:8020/hbase/hbase/meta/1588230740/recovered.edits/0000000000000000481
2019-02-01 13:59:52,304 INFO org.apache.hadoop.hbase.wal.WALSplitter: Processed
37 edits across 1 regions; edits skipped=3; log
file=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234,
length=24420, corrupted=false, progress failed=false
2019-02-01 13:59:52,309 INFO
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination: successfully
transitioned task
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
to final state DONE wellington-3.example.com,22101,1549058360167
{noformat}
Although it says the split was success, noticed this “edits skipped=3;” on the
logs. Could these 3 skipped be the missing point?
6)
{quote}Did we have data lost when HDFS back from safe mode?
{quote}
No data loss, but no write activity was being performed when hdfs was put into
safemode. Once those regions were brought online (by restarting master with the
lines from the patch), I could certify all the tables had the previous data.
> AM misses region assignment in catastrophic scenarios where RS assigned to
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21843
> URL: https://issues.apache.org/jira/browse/HBASE-21843
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 3.0.0, 2.1.0, 2.2.0
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Major
> Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into
> a situation where regions on meta were assigned to a give RS startcode that
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no
> SCP is created to this RS startcode, leaving the region "open" on a dead
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS
> assigned to the region in meta is not online and doesn't have a WAL dir, then
> mark this region as offline.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)