[
https://issues.apache.org/jira/browse/HBASE-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723527#comment-16723527
]
Sergey Shelukhin commented on HBASE-21564:
------------------------------------------
Looks like this test is sensitive to WAL roll timing. After the patch, for
whatever reason one of the WALs is rolled with some data (after the start of
the test?), and some entries in it get replicated to cluster "2", but not to
the "test..." cluster. A variable number of entries gets lost, because
cleanOldLogs cleans this WAL out before the test calls addPeer.
I'm not sure if this is a test issue, or replication code issue, still looking
at some debug logs I've added.
E.g.
{noformat}
before (13 and 13 entries)
2018-12-16 13:17:00,224 INFO [regionserver/localhost:0.logRoller]
wal.AbstractFSWAL(677): Rolled WAL
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,44577,1544994976293/localhost%2C44577%2C1544994976293.1544994984170
with entries=13, filesize=3.36 KB; new WAL
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,44577,1544994976293/localhost%2C44577%2C1544994976293.1544995019967
2018-12-16 13:17:00,399 INFO [regionserver/localhost:0.logRoller]
wal.AbstractFSWAL(677): Rolled WAL
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,43414,1544994976487/localhost%2C43414%2C1544994976487.1544994984180
with entries=13, filesize=3.35 KB; new WAL
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,43414,1544994976487/localhost%2C43414%2C1544994976487.1544995020042
after (111 and 13 entries)
2018-12-16 14:07:07,832 INFO [regionserver/localhost:0.logRoller]
wal.AbstractFSWAL(677): Rolled WAL
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,40797,1544997970689/localhost%2C40797%2C1544997970689.1544997979748
with entries=111, filesize=12.25 KB; new WAL
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,40797,1544997970689/localhost%2C40797%2C1544997970689.1544998027469
2018-12-16 14:07:08,668 INFO [regionserver/localhost:0.logRoller]
wal.AbstractFSWAL(677): Rolled WAL
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,46656,1544997971250/localhost%2C46656%2C1544997971250.1544997979723
with entries=13, filesize=3.36 KB; new WAL
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,46656,1544997971250/localhost%2C46656%2C1544997971250.1544998028193
{noformat}
In the 2nd failed case, I see 1200 entries replicated to the cluster "2"
created by base test class (or leaked by some other test, not sure), and 1102
replicated to "testInterClusterReplication" - because 111 minus 13 = 98 entries
from the rolled WAL are never replicated to it.
> race condition in WAL rolling resulting in size-based rolling getting stuck
> ---------------------------------------------------------------------------
>
> Key: HBASE-21564
> URL: https://issues.apache.org/jira/browse/HBASE-21564
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Attachments: HBASE-21564.master.001.patch,
> HBASE-21564.master.002.patch, HBASE-21564.master.003.patch,
> HBASE-21564.master.004.patch
>
>
> Manifests at least with AsyncFsWriter.
> There's a window after LogRoller replaces the writer in the WAL, but before
> it sets the rollLog boolean to false in the finally, where the WAL class can
> request another log roll (it can happen in particular when the logs are
> getting archived in the LogRoller thread, and there's high write volume
> causing the logs to roll quickly).
> LogRoller will blindly reset the rollLog flag in finally and "forget" about
> this request.
> AsyncWAL in turn never requests it again because its own rollRequested field
> is set and it expects a callback. Logs don't get rolled until a periodic roll
> is triggered after that.
> The acknowledgment of roll requests by LogRoller should be atomic.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)