[jira] [Commented] (HBASE-21843) AM misses region assignment in catastrophic scenarios where RS assigned to the region in Meta does not have a WAL dir.

Wellington Chevreuil (JIRA) Tue, 05 Feb 2019 06:19:07 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760853#comment-16760853
 ]


Wellington Chevreuil commented on HBASE-21843:
----------------------------------------------

Hey [~Apache9],

Yeah, really odd that meta table get's recovered to a state prior to that SCP 
updates. To answer your previous questions, had to go for another round of log 
reading. The suspicious meta split happened on a 3rd server, RS wellington-3, 
and there were some edits skipped, despite it's reporting the splitting was 
successful. Please find all my answers below (spoiler alert, #3 and #5 may be 
the more revealing):

1)
{quote}So why your HDFS cluster entered the safe mode?
{quote}
I manually put into safemode to test another behaviour.

2)
{quote}Did all the region servers also crash and restart?
{quote}
Yes, since hdfs was left in safemode for a while, eventually all RSes crashed 
when some internal feature that relies on HDFS write permissions tried to write 
some file (like LogRoll, for instance).

3)
{quote}Which region server hosted the meta region when you updating it?
{quote}
Meta was on wellington-4.example.com, and had been reopened there. On restart 
of "2019-01-31 16:30:56,685":
{noformat}
2019-01-31 16:32:56,931 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=1, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=wellington-4.example.com,22101,1548980460763, splitWal=true, meta=true
...
2019-01-31 16:32:57,521 INFO 
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=4, 
ppid=1, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure 
table=hbase:meta, region=1588230740; rit=OFFLINE, 
location=wellington-4.example.com,22101,1548980460763; forceNewPlan=false, 
retain=true
...
2019-01-31 16:33:04,814 INFO 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta 
(replicaId=0) location in ZooKeeper as 
wellington-4.example.com,22101,1548981048686
2019-01-31 16:33:04,871 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure 
pid=4, resume processing parent pid=1, state=RUNNABLE:SERVER_CRASH_GET_REGIONS; 
ServerCrashProcedure server=wellington-4.example.com,22101,1548980460763, 
splitWal=true, meta=true
2019-01-31 16:33:04,873 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=4, ppid=1, 
state=SUCCESS; AssignProcedure table=hbase:meta, region=1588230740 in 
7.5720sec, unfinishedSiblingCount=0
2019-01-31 16:33:04,963 INFO 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: META REPORTED: 
rit=OPEN, location=wellington-4.example.com,22101,1548981048686, 
table=hbase:meta, region=1588230740
...
{noformat}
Now this is interesting, wellington-4 crashes and an SCP is triggered to 
wellington-4 on master, just prior to the master abort:
{noformat}
2019-01-31 16:42:44,842 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=23, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=wellington-4.example.com,22101,1548981048686, splitWal=true, meta=true
2019-01-31 16:42:44,878 ERROR org.apache.hadoop.hbase.master.HMaster: ***** 
ABORTING master wellington-1.example.com,22001,1548981057654: Shutting down 
HBase cluster: file system not available *****
java.io.IOException: File system is in safemode, it can't be written now
{noformat}
Since hdfs was already in safe mode here, I guess any attempt to save this SCP 
state on masterprocwal will fail.

Jumping to the master restart, master tried to split log, but had some issues:
{noformat}
2019-02-01 13:59:38,594 WARN org.apache.hadoop.hbase.master.SplitLogManager: 
Returning success without actually splitting and deleting all the log files in 
path 
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting:
 
[HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234;
 isDirectory=false; length=24420; replication=3; blocksize=268435456; 
modification_time=1549058224769; access_time=1548981183243; owner=hbase; 
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false; 
isEncrypted=true; isErasureCoded=false}, 
HdfsLocatedFileStatus{path=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.regiongroup-0.1548981180937;
 isDirectory=false; length=2628; replication=3; blocksize=268435456; 
modification_time=1549058224751; access_time=1548981180956; owner=hbase; 
group=hbase; permission=rw-r--r--; isSymlink=false; hasAcl=false; 
isEncrypted=true; isErasureCoded=false}]
…
2019-02-01 13:59:38,619 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
Finished splitting (more than or equal to) 0 bytes in 0 log files in 
[hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting]
 in 0ms
{noformat}
Nevertheless, SCP proceeds with Assing meta subproc:
{noformat}
2019-02-01 13:59:39,057 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
subprocedures=[{pid=26, ppid=23, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
AssignProcedure table=hbase:meta, region=1588230740}]
2019-02-01 13:59:39,261 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=26, ppid=23, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta, region=1588230740
2019-02-01 13:59:39,310 INFO 
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=26, 
ppid=23, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure 
table=hbase:meta, region=1588230740; rit=OFFLINE, 
location=wellington-4.example.com,22101,1548981048686; forceNewPlan=false, 
retain=true
{noformat}
And it opens meta on wellington-4 again:
{noformat}
2019-02-01 13:59:48,542 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=26, ppid=23, 
state=SUCCESS; AssignProcedure table=hbase:meta, region=1588230740 in 
9.4610sec, unfinishedSiblingCount=0
2019-02-01 13:59:48,679 INFO 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: META REPORTED: 
rit=OPEN, location=wellington-4.example.com,22101,1549058360899, 
table=hbase:meta, region=1588230740
{noformat}
On wellington-4:
{noformat}
2019-02-01 13:59:46,807 INFO 
org.apache.hadoop.hbase.regionserver.RSRpcServices: Open 
hbase:meta,,1.1588230740
…
2019-02-01 13:59:48,429 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Opened 1588230740; next sequenceid=443
2019-02-01 13:59:48,465 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks for 
hbase:meta,,1.1588230740
{noformat}
4)
{quote}Is it still alive after HDFS came back?
{quote}
No, as mentioned above, hdfs was let in safemode for a while.

5)
{quote}If it was also dead, then did the log splitting work finish normally?
{quote}
According to master logs, yes, it did complete, I can see wellington-4 
startcode=1548981048686 log splitting is performed on wellington-3 and finished:
{noformat}
2019-02-01 13:59:51,649 INFO 
org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Task 
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
 acquired by wellington-3.example.com,22101,1549058360167
…
2019-02-01 13:59:52,309 INFO 
org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Task 
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
 entered state=DONE wellington-3.example.com,22101,1549058360167
…
2019-02-01 13:59:52,335 INFO 
org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Done 
splitting 
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
…
2019-02-01 13:59:52,401 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=23, 
state=SUCCESS; ServerCrashProcedure 
server=wellington-4.example.com,22101,1548981048686, splitWal=true, meta=true 
in 21hrs, 17mins, 7.661sec
…
{noformat}
On wellington-3 logs:
{noformat}
2019-02-01 13:59:51,650 INFO 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination: worker 
wellington-3.example.com,22101,1549058360167 acquired task 
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
2019-02-01 13:59:51,736 INFO org.apache.hadoop.hbase.wal.WALSplitter: Splitting 
WAL=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234,
 length=24420
2019-02-01 13:59:51,746 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover 
lease on dfs file 
hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234
2019-02-01 13:59:51,757 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
Recovered lease, attempt=0 on 
file=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234
 after 11ms
2019-02-01 13:59:52,203 INFO org.apache.hadoop.hbase.wal.WALSplitter: 3 split 
writers finished; closing.
2019-02-01 13:59:52,299 INFO org.apache.hadoop.hbase.wal.WALSplitter: Rename 
hdfs://wellington-1.example.com:8020/hbase/hbase/meta/1588230740/recovered.edits/0000000000000000445-wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234.temp
 to 
hdfs://wellington-1.example.com:8020/hbase/hbase/meta/1588230740/recovered.edits/0000000000000000481
2019-02-01 13:59:52,304 INFO org.apache.hadoop.hbase.wal.WALSplitter: Processed 
37 edits across 1 regions; edits skipped=3; log 
file=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234,
 length=24420, corrupted=false, progress failed=false
2019-02-01 13:59:52,309 INFO 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination: successfully 
transitioned task 
/hbase/splitWAL/WALs%2Fwellington-4.example.com%2C22101%2C1548981048686-splitting%2Fwellington-4.example.com%252C22101%252C1548981048686.wellington-4.example.com%252C22101%252C1548981048686.meta.regiongroup-0.1548981183234
 to final state DONE wellington-3.example.com,22101,1549058360167
{noformat}
Although it says the split was success, noticed this “edits skipped=3;” on the 
logs. Could these 3 skipped be the missing point?

6)
{quote}Did we have data lost when HDFS back from safe mode?
{quote}
No data loss, but no write activity was being performed when hdfs was put into 
safemode. Once those regions were brought online (by restarting master with the 
lines from the patch), I could certify all the tables had the previous data.

> AM misses region assignment in catastrophic scenarios where RS assigned to 
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21843
>                 URL: https://issues.apache.org/jira/browse/HBASE-21843
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.1.0, 2.2.0
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>         Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and 
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into 
> a situation where regions on meta were assigned to a give RS startcode that 
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS 
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no 
> SCP is created to this RS startcode, leaving the region "open" on a dead 
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS 
> assigned to the region in meta is not online and doesn't have a WAL dir, then 
> mark this region as offline. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21843) AM misses region assignment in catastrophic scenarios where RS assigned to the region in Meta does not have a WAL dir.

Reply via email to