[ 
https://issues.apache.org/jira/browse/HBASE-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977425#comment-14977425
 ] 

Ashu Pachauri commented on HBASE-14699:
---------------------------------------

Here is an extract from the logs (I added some extra logging to debug, you wont 
find it in the code): 
{code}
15/10/27 15:21:20 INFO wal.FSHLog: Rolled WAL 
/<wal_root>/<hostname>,16020,1445984296081/<hostname>%2C16020%2C1445984296081.null2.1445984298092
 with entries=15, filesize=13.96 KB; new WAL 
/<wal_root>/<hostname>,16020,1445984296081/<hostname>%2C16020%2C1445984296081.null2.1445984480906
15/10/27 15:21:21 INFO regionserver.ReplicationSourceManager: Given key: 
<hostname>%2C16020%2C1445984296081.null2.1445984480906, Deleting log: 
<hostname>%2C16020%2C1445984296081.null0.1445984481007
15/10/27 15:21:21 INFO zookeeper.RecoverableZooKeeper: Deleting znode: 
/hbase/replication/rs/<hostname>,16020,1445984296081/1/<hostname>%2C16020%2C1445984296081.null0.1445984481007,
 version: -1
{code}

ReplicationSourceManager#cleanOldLogs cleans up logs older than a given key 
(log name) by just sorting the names. It works for defaultwalprovider. However, 
for multiwal, it breaks because log0.newtimestamp (after rolling) would be  
deleted on the basis of being older than log2.oldtimestamp. The cleanup process 
should also take into account the timestamps of individual logs and not just 
the sorted order of their names.

> Replication crashes regionservers when hbase.wal.provider is set to multiwal
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-14699
>                 URL: https://issues.apache.org/jira/browse/HBASE-14699
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>
> When the hbase.wal.provider is set to multiwal and replication is enabled, 
> the regionservers start crashing with the following exception:
> {code}
> <hostname>,16020,1445495411258: Failed to write replication wal position 
> (filename=<hostname>%2C16020%2C1445495411258.null0.1445495898373, 
> position=1322399)
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for 
> /hbase/replication/rs/<hostname>,16020,1445495411258/1/<hostname>%2C16020%2C1445495411258.null0.1445495898373
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>       at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:429)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:940)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:990)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:984)
>       at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:129)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:177)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:388)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to