[ 
https://issues.apache.org/jira/browse/HBASE-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429189#comment-16429189
 ] 

Andrew Purtell edited comment on HBASE-14729 at 4/7/18 1:30 AM:
----------------------------------------------------------------

I wish I could say I know this code, but I don't. If you have a sec [~stack] I 
wonder if you could give me a pointer or two on what to look at next. 

In branch-1, if the server hosting meta crashes, and the new RS assigned to 
meta crashes again before meta is recovered and open again, do we keep tracking 
the need for a meta WAL recovery through this chain, or do we manage to lose 
the plot and orphan a meta wal, because it kind of looks like that.


was (Author: apurtell):
I wish I could say I know this code, but I don't. If you have a sec [~stack] I 
wonder if you could give me a pointer or two on what to look at next. 

In branch-1, if the server hosting meta crashes, and the new RS for meta 
crashes again before meta is recovered, do we keep tracking the split through 
this chain, or do we manage to lose the plot and orphan a meta wal, because it 
kind of looks like that.

> SplitLogManager does not clean files from WALs folder in case of master 
> failover
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-14729
>                 URL: https://issues.apache.org/jira/browse/HBASE-14729
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 1.3.2, 1.4.3
>            Reporter: Samir Ahmic
>            Assignee: Samir Ahmic
>            Priority: Major
>         Attachments: HBASE-14729.patch
>
>
> While i was testing master failover process on master branch (distributed 
> cluster setup) i notice following:
> 1. List of dead regionservers was increasing every time active master was 
> restarted.
> 2. Number of folders in /hbase/WALs folder was increasing every time active 
> master was restarted
> Here is exception from master logs showing why this is happening:
> {code}
> 2015-10-30 09:41:49,238 INFO  [ProcedureExecutor-3] master.SplitLogManager: 
> finished splitting (more than or equal to) 0 bytes in 0 log files in 
> [hdfs://P3cluster/hbase/WALs/hnode1,16000,1446043659224-splitting] in 21ms
> 2015-10-30 09:41:49,235 WARN  [ProcedureExecutor-2] master.SplitLogManager: 
> Returning success without actually splitting and deleting all the log files 
> in path hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting: 
> [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting/hnode1%2C16000%2C1446046595488.meta.1446046691314.meta;
>  isDirectory=false; length=39944; replication=3; blocksize=268435456; 
> modification_time=1446050348104; access_time=1446046691317; owner=hbase; 
> group=supergroup; permission=rw-r--r--; isSymlink=false}]
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
>  `/hbase/WALs/hnode1,16000,1446046595488-splitting is non empty': Directory 
> is not empty
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3524)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3479)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3463)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:751)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:562)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1411)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>       at com.sun.proxy.$Proxy15.delete(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:490)
>       at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>       at com.sun.proxy.$Proxy16.delete(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
>       at com.sun.proxy.$Proxy17.delete(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
>       at com.sun.proxy.$Proxy17.delete(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1726)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:588)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:584)
>       at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:584)
>       at 
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:297)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:400)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:373)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:295)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:388)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:228)
>       at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:72)
>       at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:119)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:452)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1050)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:841)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:794)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:75)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.run(ProcedureExecutor.java:479)
> {code}
> I have tracked exception to this line in SplitLogManager#splitLogDistributed
> {code}
> 297        if (fs.exists(logDir) && !fs.delete(logDir, false))
> {code}
> Since  we are removing folder we need to delete recursively so this line 
> shoud be:
> {code}
>  297        if (fs.exists(logDir) && !fs.delete(logDir, true))
> {code} 
> This solved issue. I will attach patch after some additional testing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to