[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.

Rushabh Shah (Jira) Thu, 27 May 2021 06:36:04 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rushabh Shah updated HBASE-25932:
---------------------------------
    Description: 
We are seeing the following test failure. TestWALEntryStream#testCleanClosedWALs
 This test was added in HBASE-25924. I don't think the test failure has 
anything to do with the patch in HBASE-25924.
 Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In 
all the branches, we were not parsing the wal trailer when we close the wal 
reader inside ReplicationSourceWALReader thread. The root cause was when we add 
active WAL to ReplicationSourceWALReader, we cache the file size when the wal 
was being actively written and once the wal was closed and replicated and 
removed from WALEntryStream, we did reset the ProtobufLogReader object but we 
didn't update the length of the wal and that was causing EOF errors since it 
can't find the WALTrailer with the stale wal file length.

The fix applied nicely to branch-1 since we use FSHlog implementation which 
closes the WAL file sychronously.

But in branch-2 and master, we use _AsyncFSWAL_ implementation and the closing 
of wal file is done asynchronously (as the name suggests). This is causing the 
test to fail. Below is the test.
{code:java}
  @Test
  public void testCleanClosedWALs() throws Exception {
    try (WALEntryStream entryStream = new WALEntryStream(
      logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
      assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
      appendToLogAndSync();
      assertNotNull(entryStream.next());
      log.rollWriter();  =======> This does an asynchronous close of wal.
      appendToLogAndSync();
      assertNotNull(entryStream.next());
      assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
    }
  }
{code}
In the above code, when we roll writer, we don't close the old wal file 
immediately so the ReplicationReader thread is not able to get the updated wal 
file size and that is throwing EOF errors.
 If I add a sleep of few milliseconds (1 ms in my local env) between rollWriter 
and appendToLogAndSync statement then the test passes but this is *not* a 
proper fix since we are working around the race between 
ReplicationSourceWALReaderThread and closing of WAL file.

  was:
We are seeing the following test failure. TestWALEntryStream#testCleanClosedWALs
 This test was added in HBASE-25924. I don't think the test failure has 
anything to do with the patch in HBASE-25924.
 Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In 
all the branches, we were not parsing the wal trailer when we close the wal 
reader inside ReplicationSourceWALReader thread. The root cause was when we add 
active WAL to ReplicationSourceWALReader, we cache the file size when the wal 
was being actively written and once the wal was closed and replicated and 
removed from WALEntryStream, we did reset the ProtobufLogReader object but we 
didn't update the length of the wal and that was causing EOF errors since it 
can't find the WALTrailer with the stale wal file length.

The fix applied nicely to branch-1 since we use FSHlog implementation which 
closes the WAL file sychronously.

But in branch-2 and master, we use _AsyncFSWAL_ implementation and the closing 
of wal file is done asynchronously (as the name suggests). This is causing the 
test to fail. Below is the test.
{code:java}
  @Test
  public void testCleanClosedWALs() throws Exception {
    try (WALEntryStream entryStream = new WALEntryStream(
      logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
      assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
      appendToLogAndSync();
      assertNotNull(entryStream.next());
      log.rollWriter();  =======> This does an asynchronous close of wal.
      appendToLogAndSync();
      assertNotNull(entryStream.next());
      assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
    }
  }
{code}
In the above code, when we roll writer, we don't close the old wal file 
immediately so the ReplicationReader thread is not able to get the updated wal 
file size and that is throwing EOF errors.
 If I added a sleep of few milliseconds (1 ms in my local env) between 
rollWriter and appendToLogAndSync statement then the test passes but this is 
*not* a proper fix since we are working around the race between 
ReplicationSourceWALReaderThread and closing of WAL file.


> TestWALEntryStream#testCleanClosedWALs test is failing.
> -------------------------------------------------------
>
>                 Key: HBASE-25932
>                 URL: https://issues.apache.org/jira/browse/HBASE-25932
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>
> We are seeing the following test failure. 
> TestWALEntryStream#testCleanClosedWALs
>  This test was added in HBASE-25924. I don't think the test failure has 
> anything to do with the patch in HBASE-25924.
>  Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In 
> all the branches, we were not parsing the wal trailer when we close the wal 
> reader inside ReplicationSourceWALReader thread. The root cause was when we 
> add active WAL to ReplicationSourceWALReader, we cache the file size when the 
> wal was being actively written and once the wal was closed and replicated and 
> removed from WALEntryStream, we did reset the ProtobufLogReader object but we 
> didn't update the length of the wal and that was causing EOF errors since it 
> can't find the WALTrailer with the stale wal file length.
> The fix applied nicely to branch-1 since we use FSHlog implementation which 
> closes the WAL file sychronously.
> But in branch-2 and master, we use _AsyncFSWAL_ implementation and the 
> closing of wal file is done asynchronously (as the name suggests). This is 
> causing the test to fail. Below is the test.
> {code:java}
>   @Test
>   public void testCleanClosedWALs() throws Exception {
>     try (WALEntryStream entryStream = new WALEntryStream(
>       logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
>       assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
>       appendToLogAndSync();
>       assertNotNull(entryStream.next());
>       log.rollWriter();  =======> This does an asynchronous close of wal.
>       appendToLogAndSync();
>       assertNotNull(entryStream.next());
>       assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
>     }
>   }
> {code}
> In the above code, when we roll writer, we don't close the old wal file 
> immediately so the ReplicationReader thread is not able to get the updated 
> wal file size and that is throwing EOF errors.
>  If I add a sleep of few milliseconds (1 ms in my local env) between 
> rollWriter and appendToLogAndSync statement then the test passes but this is 
> *not* a proper fix since we are working around the race between 
> ReplicationSourceWALReaderThread and closing of WAL file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.

Reply via email to