[jira] [Comment Edited] (HBASE-14729) SplitLogManager does not clean files from WALs folder in case of master failover

Andrew Purtell (JIRA) Fri, 06 Apr 2018 18:28:00 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429184#comment-16429184
 ]


Andrew Purtell edited comment on HBASE-14729 at 4/7/18 1:26 AM:
----------------------------------------------------------------

I'm seeing this in 1.3 and 1.4 test clusters, even with HBASE-14223 applied, so 
there is more going on and this isn't strictly a duplicate of that JIRA.

I've only started looking at the problem but when this occurs the problem looks 
like this:

ITBLL is running with serverKilling policy. 

A log split starts. It runs with the NON_META_FILTER PathFilter because the RS 
was not last hosting META:

{noformat}
2018-04-07 00:20:32,806 INFO  [ProcedureExecutor-2] master.SplitLogManager: 
Started splitting 5 logs in 
[hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting] 
for [node-3.cluster,16020,1523058606434]

...

2018-04-07 00:20:38,566 INFO  [ProcedureExecutor-2] master.SplitLogManager: 
finished splitting (more than or equal to) 513278945 bytes in 5 log files in 
[hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting] 
in 5760ms
{noformat}

However within the ellipsis we have this:

{noformat}
2018-04-07 00:20:38,564 WARN  [ProcedureExecutor-2] master.SplitLogManager: 
Returning success without actually splitting and deleting all the log files in 
path 
hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting: 
[FileStatus{path=hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting/node-3.cluster%2C16020%2C1523058606434.meta.1523058616638.meta;
 isDirectory=false; length=83; replication=3; blocksize=134217728; 
modification_time=1523058616640; access_time=1523058616640; owner=root; 
group=supergroup; permission=rw-r--r--; isSymlink=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
 `/hbase/WALs/node-3.cluster,16020,1523058606434-splitting is non empty': 
Directory is  not empty
{noformat}

Something bad has happened here. There is a lingering meta WAL, with a 
suspicious name.

{{hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting/node-3.cluster%2C16020%2C1523058606434.meta.1523058616638.meta}}

I think earlier we killed a regionserver that was in the middle of recovering 
the meta WAL. I'm not yet sure what went wrong, but the file was renamed, not 
archived, and isn't getting processed now. 



was (Author: apurtell):
I'm seeing this in 1.3 and 1.4 test clusters, even with HBASE-14223 applied, so 
there is more going on and this isn't strictly a duplicate of that JIRA.

I've only started looking at the problem but when this occurs the problem looks 
like this:

ITBLL is running with serverKilling policy. 

A log split starts. It runs with the NON_META_FILTER PathFilter because the RS 
was not last hosting META:

{noformat}
2018-04-07 00:20:32,806 INFO  [ProcedureExecutor-2] master.SplitLogManager: 
Started splitting 5 logs in 
[hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting] 
for [node-3.cluster,16020,1523058606434]

...

2018-04-07 00:20:38,566 INFO  [ProcedureExecutor-2] master.SplitLogManager: 
finished splitting (more than or equal to) 513278945 bytes in 5 log files in 
[hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting] 
in 5760ms
{noformat}

However within the ellipsis we have this:

{noformat}
2018-04-07 00:20:38,564 WARN  [ProcedureExecutor-2] master.SplitLogManager: 
Returning success without actually splitting and deleting all the log files in 
path 
hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting: 
[FileStatus{path=hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting/node-3.cluster%2C16020%2C1523058606434.meta.1523058616638.meta;
 isDirectory=false; length=83; replication=3; blocksize=134217728; 
modification_time=1523058616640; access_time=1523058616640; owner=root; 
group=supergroup; permission=rw-r--r--; isSymlink=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
 `/hbase/WALs/node-3.cluster,16020,1523058606434-splitting is non empty': 
Directory is  not empty
{noformat}

Something bad has happened here. There is a lingering meta WAL, with a 
suspicious name.

{{hdfs://node-1.cluster/hbase/WALs/node-3.cluster,16020,1523058606434-splitting/node-3.cluster%2C16020%2C1523058606434.meta.1523058616638.meta}}

I think earlier we killed a split worker in the middle of split of a meta WAL. 
I'm not yet sure what went wrong, but the file was renamed, not archived, and 
isn't getting processed now. 


> SplitLogManager does not clean files from WALs folder in case of master 
> failover
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-14729
>                 URL: https://issues.apache.org/jira/browse/HBASE-14729
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 1.3.2, 1.4.3
>            Reporter: Samir Ahmic
>            Assignee: Samir Ahmic
>            Priority: Major
>         Attachments: HBASE-14729.patch
>
>
> While i was testing master failover process on master branch (distributed 
> cluster setup) i notice following:
> 1. List of dead regionservers was increasing every time active master was 
> restarted.
> 2. Number of folders in /hbase/WALs folder was increasing every time active 
> master was restarted
> Here is exception from master logs showing why this is happening:
> {code}
> 2015-10-30 09:41:49,238 INFO  [ProcedureExecutor-3] master.SplitLogManager: 
> finished splitting (more than or equal to) 0 bytes in 0 log files in 
> [hdfs://P3cluster/hbase/WALs/hnode1,16000,1446043659224-splitting] in 21ms
> 2015-10-30 09:41:49,235 WARN  [ProcedureExecutor-2] master.SplitLogManager: 
> Returning success without actually splitting and deleting all the log files 
> in path hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting: 
> [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting/hnode1%2C16000%2C1446046595488.meta.1446046691314.meta;
>  isDirectory=false; length=39944; replication=3; blocksize=268435456; 
> modification_time=1446050348104; access_time=1446046691317; owner=hbase; 
> group=supergroup; permission=rw-r--r--; isSymlink=false}]
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
>  `/hbase/WALs/hnode1,16000,1446046595488-splitting is non empty': Directory 
> is not empty
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3524)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3479)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3463)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:751)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:562)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1411)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>       at com.sun.proxy.$Proxy15.delete(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:490)
>       at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>       at com.sun.proxy.$Proxy16.delete(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
>       at com.sun.proxy.$Proxy17.delete(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
>       at com.sun.proxy.$Proxy17.delete(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1726)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:588)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:584)
>       at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:584)
>       at 
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:297)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:400)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:373)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:295)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:388)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:228)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:72)
>       at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:119)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:452)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1050)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:841)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:794)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:75)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.run(ProcedureExecutor.java:479)
> {code}
> I have tracked exception to this line in SplitLogManager#splitLogDistributed
> {code}
> 297        if (fs.exists(logDir) && !fs.delete(logDir, false))
> {code}
> Since  we are removing folder we need to delete recursively so this line 
> shoud be:
> {code}
>  297        if (fs.exists(logDir) && !fs.delete(logDir, true))
> {code} 
> This solved issue. I will attach patch after some additional testing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-14729) SplitLogManager does not clean files from WALs folder in case of master failover

Reply via email to