Jasee Tao created HBASE-26120:
---------------------------------
Summary: New replication gets stuck or data loss when multiwal
groups more than 10
Key: HBASE-26120
URL: https://issues.apache.org/jira/browse/HBASE-26120
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 2.4.5, 1.7.1
Reporter: Jasee Tao
{code:java}
void preLogRoll(Path newLog) throws IOException {
recordLog(newLog);
String logName = newLog.getName();
String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
synchronized (latestPaths) {
Iterator<Path> iterator = latestPaths.iterator();
while (iterator.hasNext()) {
Path path = iterator.next();
if (path.getName().contains(logPrefix)) {
iterator.remove();
break;
}
}
this.latestPaths.add(newLog);
}
}
{code}
ReplicationSourceManager use latestPaths to track each walgroup's last WALlog
and all of them will be enqueue for replication when new replication peer
added。
If we set hbase.wal.regiongrouping.numgroups > 10, says 11, the name of WALlog
group will be regionserver.null0.timestamp to
regionserver.null1.timestamp。String.contains is used in preoLogRoll to replace
old logs in same group, leads when regionserver.null1.ts comes,
regionserver.null11.ts may be replaced, and latestPaths growing with wrong logs.
Replication then partly stuckd as regionsserver.null1.ts not exists on hdfs,
and data may not be replicated to slave as regionserver.null11.ts not in
replication queue at startup.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)