[jira] [Commented] (HBASE-14223) Meta WALs are not cleared if meta region was closed and RS aborts

Enis Soztutar (JIRA) Thu, 03 Dec 2015 13:28:58 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15038623#comment-15038623
 ]


Enis Soztutar commented on HBASE-14223:
---------------------------------------

I have figured out the root cause of the test failure. The regionserver aborts 
when the meta is opened again on that same regionserver at a later time, since 
we are closing the WalProvider and FSHLog. FSHLog assumes that it is created 
only once in the regionserver lifecycle, so it throws this exception: 
{code}
ABORTING region server localhost,65058,1449097601151: Unhandled: Target WAL 
already exists within directory 
hdfs://localhost:65046/user/enis/test-data/634ae5f2-2249-42a3-aec8-8a0bc47a8a51/WALs/localhost,6
Cause:
java.io.IOException: Target WAL already exists within directory 
hdfs://localhost:65046/user/enis/test-data/634ae5f2-2249-42a3-aec8-8a0bc47a8a51/WALs/localhost,65058,1449097601151
  at org.apache.hadoop.hbase.regionserver.wal.FSHLog.<init>(FSHLog.java:491)
  at 
org.apache.hadoop.hbase.wal.DefaultWALProvider.init(DefaultWALProvider.java:97)
  at org.apache.hadoop.hbase.wal.WALFactory.getProvider(WALFactory.java:147)
  at org.apache.hadoop.hbase.wal.WALFactory.getMetaWAL(WALFactory.java:239)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:1850)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad(HRegionServer.java:1203)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1145)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:962)
  at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:156)
  at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$000(MiniHBaseCluster.java:108)
  at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:140)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:356)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
  at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:334)
  at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:138)
  at java.lang.Thread.run(Thread.java:745)
{code}

Interestingly, the test in the patch already checks whether the meta can be 
opened in the same regionserver that previously closed the meta, but apparently 
the regionserver was aborting without causing the test to fail. 

There is also a race condition between closing of the region (and hence WAL) 
versus regionServerReport() which calls HRegionServer.getWAL() which can 
re-open the closed meta WAL. 

I'll revert the patch from all branches for now until we can find a better 
plan. 

> Meta WALs are not cleared if meta region was closed and RS aborts
> -----------------------------------------------------------------
>
>                 Key: HBASE-14223
>                 URL: https://issues.apache.org/jira/browse/HBASE-14223
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4
>
>         Attachments: HBASE-14223logs, hbase-14223_v0.patch, 
> hbase-14223_v1-branch-1.patch, hbase-14223_v2-branch-1.patch, 
> hbase-14223_v3-branch-1.patch, hbase-14223_v3-branch-1.patch, 
> hbase-14223_v3-master.patch
>
>
> When an RS opens meta, and later closes it, the WAL(FSHlog) is not closed. 
> The last WAL file just sits there in the RS WAL directory. If RS stops 
> gracefully, the WAL file for meta is deleted. Otherwise if RS aborts, WAL for 
> meta is not cleaned. It is also not split (which is correct) since master 
> determines that the RS no longer hosts meta at the time of RS abort. 
> From a cluster after running ITBLL with CM, I see a lot of {{-splitting}} 
> directories left uncleaned: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs
> Found 31 items
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 01:14 
> /apps/hbase/data/WALs/hregion-58203265
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 07:54 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433489308745-splitting
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 09:28 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433494382959-splitting
> drwxr-xr-x   - hbase hadoop          0 2015-06-05 10:01 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433498252205-splitting
> ...
> {code}
> The directories contain WALs from meta: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> Found 2 items
> -rw-r--r--   3 hbase hadoop     201608 2015-06-05 03:15 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
> -rw-r--r--   3 hbase hadoop      44420 2015-06-05 04:36 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> The RS hosted the meta region for some time: 
> {code}
> 2015-06-05 03:14:28,692 INFO  [PostOpenDeployTasks:1588230740] 
> zookeeper.MetaTableLocator: Setting hbase:meta region location in ZooKeeper 
> as os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285
> ...
> 2015-06-05 03:15:17,302 INFO  
> [RS_CLOSE_META-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion: Closed 
> hbase:meta,,1.1588230740
> {code}
> In between, a WAL is created: 
> {code}
> 2015-06-05 03:15:11,707 INFO  
> [RS_OPEN_META-os-enis-dal-test-jun-4-5:16020-0-MetaLogRoller] wal.FSHLog: 
> Rolled WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
>  with entries=385, filesize=196.88 KB; new WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> When CM killed the region server later master did not see these WAL files: 
> {code}
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:46,075 
> INFO  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] 
> master.SplitLogManager: started splitting 2 logs in 
> [hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting]
>  for [os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285]
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:47,300 
> INFO  [main-EventThread] wal.WALSplitter: Archived processed log 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436
>  to 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,497 
> INFO  [main-EventThread] wal.WALSplitter: Archived processed log 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475175329
>  to 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475175329
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,507 
> WARN  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] 
> master.SplitLogManager: returning success without actually splitting and 
> deleting all the log files in path 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,508 
> INFO  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] 
> master.SplitLogManager: finished splitting (more than or equal to) 129135000 
> bytes in 2 log files in 
> [hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting]
>  in 4433ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14223) Meta WALs are not cleared if meta region was closed and RS aborts

Reply via email to