[jira] [Updated] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

Jasee Tao (Jira) Mon, 26 Jul 2021 18:40:06 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jasee Tao updated HBASE-26120:
------------------------------
    Description: 
{code:java}
void preLogRoll(Path newLog) throws IOException {
  recordLog(newLog);
  String logName = newLog.getName();
  String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
  synchronized (latestPaths) {
    Iterator<Path> iterator = latestPaths.iterator();
    while (iterator.hasNext()) {
      Path path = iterator.next();
      if (path.getName().contains(logPrefix)) {
        iterator.remove();
        break;
      }
    }
    this.latestPaths.add(newLog);
  }
}
{code}
ReplicationSourceManager use _latestPaths_ to track each walgroup's last WALlog 
and all of them will be enqueue for replication when new replication  peer 
added。

If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of WALlog 
group will be _regionserver.null0.timestamp_ to 
_regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ to 
replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
_regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with wrong 
logs*.

Replication then partly stuckd as _regionsserver.null1.ts_ not exists on hdfs, 
and data may not be replicated to slave as _regionserver.null11.ts_ not in 
replication queue at startup.

Because of [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], 
if there is too many logs in zk _/hbase/replication/rs/regionserver/peer_, 
remove_peer may not delete this znode, and other regionserver can't not pick up 
this queue for replication failover. 

  was:
{code:java}
void preLogRoll(Path newLog) throws IOException {
  recordLog(newLog);
  String logName = newLog.getName();
  String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
  synchronized (latestPaths) {
    Iterator<Path> iterator = latestPaths.iterator();
    while (iterator.hasNext()) {
      Path path = iterator.next();
      if (path.getName().contains(logPrefix)) {
        iterator.remove();
        break;
      }
    }
    this.latestPaths.add(newLog);
  }
}
{code}
ReplicationSourceManager use _latestPaths_ to track each walgroup's last WALlog 
and all of them will be enqueue for replication when new replication  peer 
added。

If we set hbase.wal.regiongrouping.numgroups > 10, says 11, the name of WALlog 
group will be _regionserver.null0.timestamp_ to 
_regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ to 
replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
_regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with wrong 
logs*.

Replication then partly stuckd as _regionsserver.null1.ts_ not exists on hdfs, 
and data may not be replicated to slave as _regionserver.null11.ts_ not in 
replication queue at startup.


> New replication gets stuck or data loss when multiwal groups more than 10
> -------------------------------------------------------------------------
>
>                 Key: HBASE-26120
>                 URL: https://issues.apache.org/jira/browse/HBASE-26120
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.7.1, 2.4.5
>            Reporter: Jasee Tao
>            Priority: Critical
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
>   recordLog(newLog);
>   String logName = newLog.getName();
>   String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
>   synchronized (latestPaths) {
>     Iterator<Path> iterator = latestPaths.iterator();
>     while (iterator.hasNext()) {
>       Path path = iterator.next();
>       if (path.getName().contains(logPrefix)) {
>         iterator.remove();
>         break;
>       }
>     }
>     this.latestPaths.add(newLog);
>   }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last 
> WALlog and all of them will be enqueue for replication when new replication  
> peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of 
> WALlog group will be _regionserver.null0.timestamp_ to 
> _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ 
> to replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
> _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with 
> wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on 
> hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not 
> in replication queue at startup.
> Because of 
> [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there 
> is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer 
> may not delete this znode, and other regionserver can't not pick up this 
> queue for replication failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

Reply via email to