[
https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353331#comment-17353331
]
Viraj Jasani commented on HBASE-25932:
--------------------------------------
While closing WAL, AbstractProtobufLogWriter#writeWALTrailer uses
writeWALTrailerAndMagic which would internally go to
AsyncProtobufLogWriter#writeWALTrailerAndMagic and it is an async writer. For a
while, got confused as it uses addListener from our FutureUtils on
*_output.flush(false)_*, just realized that it is actually blocking on flush's
CompletableFuture to retrieve acked length after flushing.
{code:java}
private long write(Consumer<CompletableFuture<Long>> action) throws IOException
{
CompletableFuture<Long> future = new CompletableFuture<>();
action.accept(future);
try {
return future.get().longValue(); <=== blocking
} catch (InterruptedException e) {
InterruptedIOException ioe = new InterruptedIOException();
ioe.initCause(e);
throw ioe;
} catch (ExecutionException e) {
Throwables.propagateIfPossible(e.getCause(), IOException.class);
throw new RuntimeException(e.getCause());
}
}
{code}
I just wanted to ensure that at least _*writer.close()*_ is synchronous which
is executed by closeExecutor.
Looks safe to use inflightWALClosures map.
However, with the patch, few test case might fail. For instance,
testLogrollWhileStreaming() is consistently failing. It passes only when we
wait on entryStream.next() to be non-null.
Rough patch in local to resolve test failure:
{code:java}
diff --git
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestWALEntryStream.java
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestWALEntryStream.java
index 5507972ab4..c67f5f143d 100644
---
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestWALEntryStream.java
+++
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestWALEntryStream.java
@@ -266,7 +266,14 @@ public class TestWALEntryStream {
assertEquals("2", getRow(entryStream.next()));
assertEquals(2, getQueue().size()); // we should not have dequeued yet
since there's still an
// entry in first log
- assertEquals("3", getRow(entryStream.next())); // if implemented
improperly, this would be 4
+ Entry entry;
+ while (true){
+ entry=entryStream.next();
+ if(entry!=null){
+ break;
+ }
+ }
+ assertEquals("3", getRow(entry)); // if implemented improperly, this
would be 4
// and 3 would be skipped
assertEquals("4", getRow(entryStream.next())); // 4
{code}
And it seems for test, it should be safe to wait on entryStream.next() to be
non-null. Ultimately, same is followed by Replication source shipper.
> TestWALEntryStream#testCleanClosedWALs test is failing.
> -------------------------------------------------------
>
> Key: HBASE-25932
> URL: https://issues.apache.org/jira/browse/HBASE-25932
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4
> Reporter: Rushabh Shah
> Assignee: Rushabh Shah
> Priority: Major
> Attachments: HBASE-25932-test-approach.patch
>
>
> We are seeing the following test failure.
> TestWALEntryStream#testCleanClosedWALs
> This test was added in HBASE-25924. I don't think the test failure has
> anything to do with the patch in HBASE-25924.
> Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In
> all the branches, we were not parsing the wal trailer when we close the wal
> reader inside ReplicationSourceWALReader thread. The root cause was when we
> add active WAL to ReplicationSourceWALReader, we cache the file size when the
> wal was being actively written and once the wal was closed and replicated and
> removed from WALEntryStream, we did reset the ProtobufLogReader object but we
> didn't update the length of the wal and that was causing EOF errors since it
> can't find the WALTrailer with the stale wal file length.
> The fix applied nicely to branch-1 since we use FSHlog implementation which
> closes the WAL file sychronously.
> But in branch-2 and master, we use _AsyncFSWAL_ implementation and the
> closing of wal file is done asynchronously (as the name suggests). This is
> causing the test to fail. Below is the test.
> {code:java}
> @Test
> public void testCleanClosedWALs() throws Exception {
> try (WALEntryStream entryStream = new WALEntryStream(
> logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
> assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> appendToLogAndSync();
> assertNotNull(entryStream.next());
> log.rollWriter(); =======> This does an asynchronous close of wal.
> appendToLogAndSync();
> assertNotNull(entryStream.next());
> assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> }
> }
> {code}
> In the above code, when we roll writer, we don't close the old wal file
> immediately so the ReplicationReader thread is not able to get the updated
> wal file size and that is throwing EOF errors.
> If I add a sleep of few milliseconds (1 ms in my local env) between
> rollWriter and appendToLogAndSync statement then the test passes but this is
> *not* a proper fix since we are working around the race between
> ReplicationSourceWALReaderThread and closing of WAL file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)