[
https://issues.apache.org/jira/browse/HBASE-29480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-29480:
-----------------------------------
Labels: pull-request-available (was: )
> testCleanClosedWALs fails on branch-2.5
> ---------------------------------------
>
> Key: HBASE-29480
> URL: https://issues.apache.org/jira/browse/HBASE-29480
> Project: HBase
> Issue Type: Bug
> Reporter: Junegunn Choi
> Priority: Minor
> Labels: pull-request-available
>
> h2. Problem
> I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5.
> {noformat}
> [ERROR] Failures:
> [ERROR]
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs
> [ERROR] Run 1:
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> [ERROR] Run 2:
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> [ERROR] Run 3:
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> [ERROR] Run 4:
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> [INFO]
> [ERROR]
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs
> [ERROR] Run 1:
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> [ERROR] Run 2:
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> [ERROR] Run 3:
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> [ERROR] Run 4:
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
> expected:<0> but was:<1>
> {noformat}
> h2. Workaround
> If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932),
> the tests pass.
> {code:java}
> diff --git
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
>
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> index a997cd0a9d..7c47732259 100644
> ---
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> +++
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> @@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends
> WALEntryStreamTestBase {
> appendToLogAndSync();
> assertNotNull(entryStream.next());
> log.rollWriter();
> + Thread.sleep(10);
> appendToLogAndSync();
> assertNotNull(entryStream.next());
> assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> {code}
> h2. Analysis
> When the test fails, the following log appears:
> {noformat}
> 2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447):
> Encountered a malformed edit, seeking back to last good position in file,
> from 218 to 210
> java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210,
> currentPosition=218
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
> ~[classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
> ~[test-classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
> ~[test-classes/:?]
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184)
> ~[test-classes/:?]
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135)
> ~[test-classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
> ~[test-classes/:?]
> at
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727)
> ~[test-classes/:?]
> {noformat}
> Which led me to this code:
> [https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376]
> The logic here seems to contradict the comment ("only rethrow EOF if it
> indicates an EOF") and throw EOF even when it has reached the WAL trailer.
> (Or maybe I misread the comment?)
> If I change the code to _not_ throw the exception when the WAL trailer is
> found, the test passes.
> While I have a limited understanding of this area, this behavior seems
> reasonable: if we have reached the trailer, there's no more entry, so we can
> just return {{{}false{}}}.
> Any thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)