[jira] [Updated] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

Jasee Tao (Jira) Sun, 25 Jul 2021 20:01:06 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jasee Tao updated HBASE-26120:
------------------------------
    Description: 
{code:java}
void preLogRoll(Path newLog) throws IOException {
  recordLog(newLog);
  String logName = newLog.getName();
  String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
  synchronized (latestPaths) {
    Iterator<Path> iterator = latestPaths.iterator();
    while (iterator.hasNext()) {
      Path path = iterator.next();
      if (path.getName().contains(logPrefix)) {
        iterator.remove();
        break;
      }
    }
    this.latestPaths.add(newLog);
  }
}
{code}
ReplicationSourceManager use _latestPaths_ to track each walgroup's last WALlog 
and all of them will be enqueue for replication when new replication  peer 
added。

If we set hbase.wal.regiongrouping.numgroups > 10, says 11, the name of WALlog 
group will be _regionserver.null0.timestamp_ to 
_regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ to 
replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
_regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with wrong 
logs*.

Replication then partly stuckd as _regionsserver.null1.ts_ not exists on hdfs, 
and data may not be replicated to slave as _regionserver.null11.ts_ not in 
replication queue at startup.

  was:
{code:java}
void preLogRoll(Path newLog) throws IOException {
  recordLog(newLog);
  String logName = newLog.getName();
  String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
  synchronized (latestPaths) {
    Iterator<Path> iterator = latestPaths.iterator();
    while (iterator.hasNext()) {
      Path path = iterator.next();
      if (path.getName().contains(logPrefix)) {
        iterator.remove();
        break;
      }
    }
    this.latestPaths.add(newLog);
  }
}
{code}
ReplicationSourceManager use latestPaths to track each walgroup's last WALlog 
and all of them will be enqueue for replication when new replication  peer 
added。

If we set hbase.wal.regiongrouping.numgroups > 10, says 11, the name of WALlog 
group will be regionserver.null0.timestamp to 
regionserver.null1.timestamp。String.contains is used in preoLogRoll to replace 
old logs in same group, leads when regionserver.null1.ts comes, 
regionserver.null11.ts may be replaced, and latestPaths growing with wrong logs.

Replication then partly stuckd as regionsserver.null1.ts not exists on hdfs, 
and data may not be replicated to slave as regionserver.null11.ts not in 
replication queue at startup.


> New replication gets stuck or data loss when multiwal groups more than 10
> -------------------------------------------------------------------------
>
>                 Key: HBASE-26120
>                 URL: https://issues.apache.org/jira/browse/HBASE-26120
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.7.1, 2.4.5
>            Reporter: Jasee Tao
>            Priority: Major
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
>   recordLog(newLog);
>   String logName = newLog.getName();
>   String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
>   synchronized (latestPaths) {
>     Iterator<Path> iterator = latestPaths.iterator();
>     while (iterator.hasNext()) {
>       Path path = iterator.next();
>       if (path.getName().contains(logPrefix)) {
>         iterator.remove();
>         break;
>       }
>     }
>     this.latestPaths.add(newLog);
>   }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last 
> WALlog and all of them will be enqueue for replication when new replication  
> peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 11, the name of 
> WALlog group will be _regionserver.null0.timestamp_ to 
> _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ 
> to replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
> _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with 
> wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on 
> hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not 
> in replication queue at startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

Reply via email to