Apache9 commented on PR #5443:
URL: https://github.com/apache/hbase/pull/5443#issuecomment-1751996650
```
2023-10-08T06:42:20,699 INFO [RS:0;c85796445aa7:36275 {}]
wal.AbstractFSWAL(980): New WAL
/user/jenkins/test-data/daf7aa3e-9117-b174-d093-9e8787b5dfb7/WALs/c85796445aa7,36275,1696747321009/c85796445aa7%2C36275%2C1696747321009-1696747340579-1.1696747340583.syncrep
2023-10-08T06:42:20,708 DEBUG [RS:0;c85796445aa7:36275 {}]
regionserver.ReplicationSourceManager(789): Start tracking logs for wal group
c85796445aa7%2C36275%2C1696747321009-1696747340579-1 for peer 1
2023-10-08T06:42:20,709 DEBUG [RS:0;c85796445aa7:36275 {}]
regionserver.ReplicationSource(369): peerId=1, starting shipping worker for
walGroupId=c85796445aa7%2C36275%2C1696747321009-1696747340579-1
2023-10-08T06:42:20,709 INFO [RS:0;c85796445aa7:36275 {}]
regionserver.ReplicationSourceWALReader(109):
peerClusterZnode=1-c85796445aa7,36275,1696747321009,
ReplicationSourceWALReaderThread : 1 inited,
replicationBatchSizeCapacity=102400, replicationBatchCountCapacity=25000,
replicationBatchQueueCapacity=1
2023-10-08T06:42:20,710 DEBUG
[RS:0;c85796445aa7:36275.replicationSource.wal-reader.c85796445aa7%2C36275%2C1696747321009-1696747340579-1,1-c85796445aa7,36275,1696747321009
{}] regionserver.WALEntryStream(249): Creating new reader
hdfs://localhost:32875/user/jenkins/test-data/daf7aa3e-9117-b174-d093-9e8787b5dfb7/WALs/c85796445aa7,36275,1696747321009/c85796445aa7%2C36275%2C1696747321009-1696747340579-1.1696747340583.syncrep,
startPosition=0, beingWritten=false
```
You can see that, the beingWritten flag for a newly created WAL file is
false, which obviously incorrect...
I think the problem is specific to sync replication
```
private WAL getRemoteWAL(RegionInfo region, String peerId, String
remoteWALDir)
throws IOException {
Optional<WAL> opt = peerId2WAL.get(peerId);
if (opt != null) {
return opt.orElse(null);
}
Lock lock = createLock.acquireLock(peerId);
try {
opt = peerId2WAL.get(peerId);
if (opt != null) {
return opt.orElse(null);
}
WAL wal = createRemoteWAL(region,
ReplicationUtils.getRemoteWALFileSystem(conf, remoteWALDir),
ReplicationUtils.getPeerRemoteWALDir(remoteWALDir, peerId),
getRemoteWALPrefix(peerId),
ReplicationUtils.SYNC_WAL_SUFFIX);
initWAL(wal);
peerId2WAL.put(peerId, Optional.of(wal));
return wal;
} finally {
lock.unlock();
}
}
```
Here, we will init the WAL first, before putting it into the peerId2WAL map,
and in WAL.init, we will call rollWriter, so we will insert the WAL file into
the replication queue, and if we test whether the file is beingWritten before
putting it into the peerId2WAL map, we will get a false, which is incorrect...
For normal replication, in AbstractFSWAL, if the wal field is null, we will
hold the createLock before returning nothing so it will be safe.
Let me think how to fix this, should be another issue...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]