[jira] [Commented] (HBASE-14223) Meta WALs are not cleared if meta region was closed and RS aborts

Samir Ahmic (JIRA) Tue, 03 Nov 2015 06:03:20 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987312#comment-14987312
 ]


Samir Ahmic commented on HBASE-14223:
-------------------------------------

As per [~enis] suggestion from 
[-HBASE-14729-|https://issues.apache.org/jira/browse/HBASE-14729] i have pickup 
his patch for testing, and i was able  with few additions to build hbase on 
master branch. After applying patch issue was still present ( i was testing on 
distributed cluster : :hadoop-2.5.2, zookeeper-3.4.6, hbase-2.0.0-SNAPSHOT ) 
but i have found some interesting things in logs which may put some more light 
on this issue.
I was tracking single case when meta WALs file is not cleared here is sequence:
HDFS layout:
{code}
[hadoop@hnode2 ~]$ hadoop fs -ls -R /hbase/WALs
drwxr-xr-x   - hbase supergroup          0 2015-11-03 10:14 
/hbase/WALs/hnode2,16000,1446541683349-splitting
-rw-r--r--   3 hbase supergroup         83 2015-11-03 10:13 
/hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.meta.1446541987763.meta
{code}

Few time in logs i have notice that we are reporting 0 files in -splitting- 
folder but files were on HDFS so i have added some additional logging in 
DefaultWALProvider#isMetaFile(), we were using this method to filter files from 
WALs folder
{code}
public static boolean isMetaFile(String p) {
    if (p != null && p.endsWith(META_WAL_PROVIDER_ID)) {
      LOG.info("***Returning TRUE for: " + p);
      return true;
    }
    LOG.info("***Returning FALSE for: "+p);
    return false;
  }
{code} 
And here are logs for active master that was processing WALs file:
{code}
2015-11-03 10:14:32,059 INFO  [hnode2:16000.activeMasterManager] 
master.MasterFileSystem: Log folder 
hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349 doesn't belong to a 
known region server, splitting
2015-11-03 10:14:33,686 DEBUG [hnode2:16000.activeMasterManager] 
procedure2.ProcedureExecutor: Procedure ServerCrashProcedure 
serverName=hnode2,16000,1446541683349, shouldSplitWal=true, carryingMeta=false 
id=852 state=RUNNABLE:SERVER_CRASH_START added to the store.
2015-11-03 10:14:33,921 DEBUG [ProcedureExecutor-1] 
procedure.ServerCrashProcedure: Splitting logs from hnode2,16000,1446541683349; 
region count=0
2015-11-03 10:14:33,944 DEBUG [ProcedureExecutor-1] master.MasterFileSystem: 
Renamed region directory: 
hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting
2015-11-03 10:14:33,944 INFO  [ProcedureExecutor-1] master.SplitLogManager: 
dead splitlog workers [hnode2,16000,1446541683349]
2015-11-03 10:14:33,968 INFO  [ProcedureExecutor-1] wal.DefaultWALProvider: 
***Returning FALSE for: hnode2%2C16000%2C1446541683349.1446541986629
2015-11-03 10:14:33,969 INFO  [ProcedureExecutor-1] wal.DefaultWALProvider: 
***Returning TRUE for: hnode2%2C16000%2C1446541683349.meta.1446541987763.meta
2015-11-03 10:14:33,969 INFO  [ProcedureExecutor-1] master.SplitLogManager: 
Started splitting 1 logs in 
[hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting] for 
[hnode2,16000,1446541683349]
2015-11-03 10:14:34,065 INFO  [SplitLogWorker-hnode2:16000] 
wal.DefaultWALProvider: ***Returning FALSE for: 
WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,080 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination: put up splitlog task at znode 
/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,123 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: task 
/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
 acquired by hnode5,16020,1446455348779
2015-11-03 10:14:34,135 INFO  [main-EventThread] 
coordination.ZkSplitLogWorkerCoordination: task 
/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
 preempted from hnode2,16000,1446542058824, current task state and owner=OWNED 
hnode5,16020,1446455348779
2015-11-03 10:14:34,963 INFO  
[hnode2,16000,1446542058824_splitLogManager__ChoreService_1] 
master.SplitLogManager: total tasks = 1 unassigned = 0 
tasks={/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629=last_update
 = 1446542074134 last_version = 2 cur_worker_name = hnode5,16020,1446455348779 
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 
0 error = 0}
2015-11-03 10:14:38,183 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: task 
/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
 entered state: DONE hnode5,16020,1446455348779
2015-11-03 10:14:38,199 INFO  [main-EventThread] wal.WALSplitter: Archived 
processed log 
hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.1446541986629
 to hdfs://P3cluster/hbase/oldWALs/hnode2%2C16000%2C1446541683349.1446541986629
2015-11-03 10:14:38,202 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: Done splitting 
/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629

2015-11-03 10:14:38,209 WARN  [ProcedureExecutor-1] master.SplitLogManager: 
Returning success without actually splitting and deleting all the log files in 
path hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting: 
[FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.meta.1446541987763.meta;
 isDirectory=false; length=83; replication=3; blocksize=268435456; 
modification_time=1446541987767; access_time=1446541987767; owner=hbase; 
group=supergroup; permission=rw-r--r--; isSymlink=false}]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
 `/hbase/WALs/hnode2,16000,1446541683349-splitting is non empty': Directory is 
not empty

2015-11-03 10:14:38,210 INFO  [ProcedureExecutor-1] master.SplitLogManager: 
finished splitting (more than or equal to) 83 bytes in 1 log files in 
[hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting] in 4241ms

2015-11-03 10:14:38,422 INFO  [ProcedureExecutor-0] 
procedure.ServerCrashProcedure: Finished processing of crashed 
hnode2,16000,1446541683349
2015-11-03 10:14:38,526 DEBUG [ProcedureExecutor-0] 
procedure2.ProcedureExecutor: Procedure completed in 4.8450sec: 
ServerCrashProcedure serverName=hnode2,16000,1446541683349, 
shouldSplitWal=true, carryingMeta=false id=852 state=FINISHED
{code}
What look wrong here are this lines:
{code}
2015-11-03 10:14:34,065 INFO  [SplitLogWorker-hnode2:16000] 
wal.DefaultWALProvider: ***Returning FALSE for: 
WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,080 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination: put up splitlog task at znode 
/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
2015-11-03 10:14:34,123 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: task 
/hbase/splitWAL/WALs%2Fhnode2%2C16000%2C1446541683349-splitting%2Fhnode2%252C16000%252C1446541683349.1446541986629
 acquired by hnode5,16020,1446455348779
{code}

We are clearly sending non existing WALs file to be processed. I'm not sure why 
file name was deformed (i suspect on PathFilter filter). And rs which acquires 
this task throws:
{code}
2015-11-03 10:14:38,118 INFO  [RS_LOG_REPLAY_OPS-hnode5:16020-0] 
util.FSHDFSUtils: Recovered lease, attempt=1 on 
file=hdfs://P3cluster/hbase/WALs/hnode2,16000,1446541683349-splitting/hnode2%2C16000%2C1446541683349.1446541986629
 after 4005ms
2015-11-03 10:14:38,120 WARN  [RS_LOG_REPLAY_OPS-hnode5:16020-0] 
shortcircuit.DomainSocketFactory: error creating DomainSocket
java.net.ConnectException: connect(2) error: No such file or directory when 
trying to connect to 'none'
{code} 

So i have removed filter parametar from SplitLogManager#getFileList()
{code}
-      FileStatus[] logfiles = FSUtils.listStatus(fs, logDir, filter);
+      FileStatus[] logfiles = FSUtils.listStatus(fs, logDir);
{code}
and as result issue was gone in my case. I will attach logs from master in case 
with filter and without filter.
 



> Meta WALs are not cleared if meta region was closed and RS aborts
> -----------------------------------------------------------------
>
>                 Key: HBASE-14223
>                 URL: https://issues.apache.org/jira/browse/HBASE-14223
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>             Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.4
>
>         Attachments: hbase-14223_v0.patch
>
>
> When an RS opens meta, and later closes it, the WAL(FSHlog) is not closed. 
> The last WAL file just sits there in the RS WAL directory. If RS stops 
> gracefully, the WAL file for meta is deleted. Otherwise if RS aborts, WAL for 
> meta is not cleaned. It is also not split (which is correct) since master 
> determines that the RS no longer hosts meta at the time of RS abort. 
> From a cluster after running ITBLL with CM, I see a lot of {{-splitting}} 
> directories left uncleaned: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs
> Found 31 items
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 01:14 
> /apps/hbase/data/WALs/hregion-58203265
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 07:54 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433489308745-splitting
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 09:28 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433494382959-splitting
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 10:01 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433498252205-splitting
> ...
> {code}
> The directories contain WALs from meta: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> Found 2 items
> -rw-r--r--   3 hbase hadoop     201608 2015-06-05 03:15 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
> -rw-r--r--   3 hbase hadoop      44420 2015-06-05 04:36 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> The RS hosted the meta region for some time: 
> {code}
> 2015-06-05 03:14:28,692 INFO  [PostOpenDeployTasks:1588230740] 
> zookeeper.MetaTableLocator: Setting hbase:meta region location in ZooKeeper 
> as os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285
> ...
> 2015-06-05 03:15:17,302 INFO  
> [RS_CLOSE_META-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion: Closed 
> hbase:meta,,1.1588230740
> {code}
> In between, a WAL is created: 
> {code}
> 2015-06-05 03:15:11,707 INFO  
> [RS_OPEN_META-os-enis-dal-test-jun-4-5:16020-0-MetaLogRoller] wal.FSHLog: 
> Rolled WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
>  with entries=385, filesize=196.88 KB; new WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> When CM killed the region server later master did not see these WAL files: 
> {code}
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:46,075 
> INFO  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] 
> master.SplitLogManager: started splitting 2 logs in 
> [hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting]
>  for [os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285]
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:47,300 
> INFO  [main-EventThread] wal.WALSplitter: Archived processed log 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436
>  to 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,497 
> INFO  [main-EventThread] wal.WALSplitter: Archived processed log 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475175329
>  to 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475175329
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,507 
> WARN  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] 
> master.SplitLogManager: returning success without actually splitting and 
> deleting all the log files in path 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,508 
> INFO  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] 
> master.SplitLogManager: finished splitting (more than or equal to) 129135000 
> bytes in 2 log files in 
> [hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting]
>  in 4433ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14223) Meta WALs are not cleared if meta region was closed and RS aborts

Reply via email to