Apache9 commented on PR #5443:
URL: https://github.com/apache/hbase/pull/5443#issuecomment-1751996650

   ```
   2023-10-08T06:42:20,699 INFO  [RS:0;c85796445aa7:36275 {}] 
wal.AbstractFSWAL(980): New WAL 
/user/jenkins/test-data/daf7aa3e-9117-b174-d093-9e8787b5dfb7/WALs/c85796445aa7,36275,1696747321009/c85796445aa7%2C36275%2C1696747321009-1696747340579-1.1696747340583.syncrep
   2023-10-08T06:42:20,708 DEBUG [RS:0;c85796445aa7:36275 {}] 
regionserver.ReplicationSourceManager(789): Start tracking logs for wal group 
c85796445aa7%2C36275%2C1696747321009-1696747340579-1 for peer 1
   2023-10-08T06:42:20,709 DEBUG [RS:0;c85796445aa7:36275 {}] 
regionserver.ReplicationSource(369): peerId=1, starting shipping worker for 
walGroupId=c85796445aa7%2C36275%2C1696747321009-1696747340579-1
   2023-10-08T06:42:20,709 INFO  [RS:0;c85796445aa7:36275 {}] 
regionserver.ReplicationSourceWALReader(109): 
peerClusterZnode=1-c85796445aa7,36275,1696747321009, 
ReplicationSourceWALReaderThread : 1 inited, 
replicationBatchSizeCapacity=102400, replicationBatchCountCapacity=25000, 
replicationBatchQueueCapacity=1
   2023-10-08T06:42:20,710 DEBUG 
[RS:0;c85796445aa7:36275.replicationSource.wal-reader.c85796445aa7%2C36275%2C1696747321009-1696747340579-1,1-c85796445aa7,36275,1696747321009
 {}] regionserver.WALEntryStream(249): Creating new reader 
hdfs://localhost:32875/user/jenkins/test-data/daf7aa3e-9117-b174-d093-9e8787b5dfb7/WALs/c85796445aa7,36275,1696747321009/c85796445aa7%2C36275%2C1696747321009-1696747340579-1.1696747340583.syncrep,
 startPosition=0, beingWritten=false
   ```
   
   You can see that, the beingWritten flag for a newly created WAL file is 
false, which obviously incorrect...
   I think the problem is specific to sync replication
   
   ```
     private WAL getRemoteWAL(RegionInfo region, String peerId, String 
remoteWALDir)
       throws IOException {
       Optional<WAL> opt = peerId2WAL.get(peerId);
       if (opt != null) {
         return opt.orElse(null);
       }
       Lock lock = createLock.acquireLock(peerId);
       try {
         opt = peerId2WAL.get(peerId);
         if (opt != null) {
           return opt.orElse(null);
         }
         WAL wal = createRemoteWAL(region, 
ReplicationUtils.getRemoteWALFileSystem(conf, remoteWALDir),
           ReplicationUtils.getPeerRemoteWALDir(remoteWALDir, peerId), 
getRemoteWALPrefix(peerId),
           ReplicationUtils.SYNC_WAL_SUFFIX);
         initWAL(wal);
         peerId2WAL.put(peerId, Optional.of(wal));
         return wal;
       } finally {
         lock.unlock();
       }
     }
   ```
   
   Here, we will init the WAL first, before putting it into the peerId2WAL map, 
and in WAL.init, we will call rollWriter, so we will insert the WAL file into 
the replication queue, and if we test whether the file is beingWritten before 
putting it into the peerId2WAL map, we will get a false, which is incorrect...
   
   For normal replication, in AbstractFSWAL, if the wal field is null, we will 
hold the createLock before returning nothing so it will be safe.
   
   Let me think how to fix this, should be another issue...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to