shahrs87 commented on code in PR #5521:
URL: https://github.com/apache/hbase/pull/5521#discussion_r1456240779


##########
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java:
##########
@@ -259,10 +259,11 @@ private boolean readNextEntryAndRecordReaderPosition() 
throws IOException {
     Entry readEntry = reader.next();
     long readerPos = reader.getPosition();
     OptionalLong fileLength;
-    if (logQueue.getQueueSize(walGroupId) > 1) {
+    if (logQueue.getQueueSize(walGroupId) > 2) {

Review Comment:
   > Any updates here?
   @Apache9  Sorry couldn't update this thread in a long time.  Got distracted 
somewhere and it fell off my radar.
   
   > Changing to greater than 2 can fix the failing tests? A bit strange, could 
you please exlain more on this?
   
   Actually changing to greater than 2 fixes the failing test but looks like it 
is not the right fix.
   The test is doing the following:
   1. Creating  WAL named wal1
   2. Appending some entries to wal1
   3. Calling entryStream.next to read from wal1
   4. Roll the WAL to wal2
   5. Append some entries to wal2
   6. Call entryStream.next to read from wal2
   7. Test that  there are NO uncleanlyClosedLogs metric.
   
   The test is failing at #6 above. When it is calling entryStream.next on 
wal2, the replication code needs to switch the reader to the new WAL file. 
During rollWriter, we add it to `AbstractFSWAL#inflightWALClosures` map and 
close the old WAL file asynchronously 
[here](https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java#L396-L407).
   
   In closeWriter method, we append the trailer to the WAL and then close it.
   During the closeWriter execution there will be 2 WALs in the logQueue.
   Now in WALEntryStream#next method, after [this 
change](https://github.com/apache/hbase/pull/5505/files), we don't read the 
file length if logQueue size is greater than 1 and hence WALEntryStream is 
unaware of the trailer bytes and while switching the wal from wal1 to wal2, it 
gets the following exception:
   ```
   2024-01-17T10:05:47,247 DEBUG [Listener at localhost/52964] 
wal.ProtobufLogReader(447): Encountered a malformed edit, seeking back to last 
good position in file, from 218 to 210
   java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, 
currentPosition=218
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
 ~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) 
~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) 
~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:726)
 ~[test-classes/:?]
   ```
   
   I think I know how to fix.
   From [PR-5505](https://github.com/apache/hbase/pull/5505/files), we have the 
below check
   ```
       OptionalLong fileLength;
       if (logQueue.getQueueSize(walGroupId) > 1) {
         fileLength = OptionalLong.empty();
       } else {
         // if there is only one file in queue, check whether it is still being 
written to
         fileLength = 
walFileLengthProvider.getLogFileSizeIfBeingWritten(currentPath);
       }
   ```
   Along with checking queue size, we also have to check if the currently 
replicated WAL is not in AbstractFSWAL#inflightWALClosures map then it is safe 
to not read the file size.
   
   But currently there is NO way to access AbstractFSWAL object from 
WALEntryStream. I found one class named 
[WALFileLengthProvider](https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALFileLengthProvider.java)
 but that is a functional interface and as the name suggests it only provides 
log file size if that file is currently being written.
   
   @Apache9  Any ideas on how can we expose AbstractFSWAL object to 
WALEntryStream? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to