[ https://issues.apache.org/jira/browse/HBASE-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429189#comment-16429189 ]
Andrew Purtell edited comment on HBASE-14729 at 4/7/18 1:30 AM: ---------------------------------------------------------------- I wish I could say I know this code, but I don't. If you have a sec [~stack] I wonder if you could give me a pointer or two on what to look at next. In branch-1, if the server hosting meta crashes, and the new RS assigned to meta crashes again before meta is recovered and open again, do we keep tracking the need for a meta WAL recovery through this chain, or do we manage to lose the plot and orphan a meta wal, because it kind of looks like that. was (Author: apurtell): I wish I could say I know this code, but I don't. If you have a sec [~stack] I wonder if you could give me a pointer or two on what to look at next. In branch-1, if the server hosting meta crashes, and the new RS for meta crashes again before meta is recovered, do we keep tracking the split through this chain, or do we manage to lose the plot and orphan a meta wal, because it kind of looks like that. > SplitLogManager does not clean files from WALs folder in case of master > failover > -------------------------------------------------------------------------------- > > Key: HBASE-14729 > URL: https://issues.apache.org/jira/browse/HBASE-14729 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 1.3.2, 1.4.3 > Reporter: Samir Ahmic > Assignee: Samir Ahmic > Priority: Major > Attachments: HBASE-14729.patch > > > While i was testing master failover process on master branch (distributed > cluster setup) i notice following: > 1. List of dead regionservers was increasing every time active master was > restarted. > 2. Number of folders in /hbase/WALs folder was increasing every time active > master was restarted > Here is exception from master logs showing why this is happening: > {code} > 2015-10-30 09:41:49,238 INFO [ProcedureExecutor-3] master.SplitLogManager: > finished splitting (more than or equal to) 0 bytes in 0 log files in > [hdfs://P3cluster/hbase/WALs/hnode1,16000,1446043659224-splitting] in 21ms > 2015-10-30 09:41:49,235 WARN [ProcedureExecutor-2] master.SplitLogManager: > Returning success without actually splitting and deleting all the log files > in path hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting: > [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting/hnode1%2C16000%2C1446046595488.meta.1446046691314.meta; > isDirectory=false; length=39944; replication=3; blocksize=268435456; > modification_time=1446050348104; access_time=1446046691317; owner=hbase; > group=supergroup; permission=rw-r--r--; isSymlink=false}] > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException): > `/hbase/WALs/hnode1,16000,1446046595488-splitting is non empty': Directory > is not empty > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3524) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3479) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3463) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:751) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:562) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at org.apache.hadoop.ipc.Client.call(Client.java:1411) > at org.apache.hadoop.ipc.Client.call(Client.java:1364) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy15.delete(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:490) > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy16.delete(Unknown Source) > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) > at com.sun.proxy.$Proxy17.delete(Unknown Source) > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279) > at com.sun.proxy.$Proxy17.delete(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1726) > at > org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:588) > at > org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:584) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:584) > at > org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:297) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:400) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:373) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:295) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:388) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:228) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:72) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:119) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:452) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1050) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:841) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:794) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.run(ProcedureExecutor.java:479) > {code} > I have tracked exception to this line in SplitLogManager#splitLogDistributed > {code} > 297 if (fs.exists(logDir) && !fs.delete(logDir, false)) > {code} > Since we are removing folder we need to delete recursively so this line > shoud be: > {code} > 297 if (fs.exists(logDir) && !fs.delete(logDir, true)) > {code} > This solved issue. I will attach patch after some additional testing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)