[jira] [Commented] (HBASE-21564) race condition in WAL rolling resulting in size-based rolling getting stuck

Sergey Shelukhin (JIRA) Mon, 17 Dec 2018 16:46:12 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723527#comment-16723527
 ]


Sergey Shelukhin commented on HBASE-21564:
------------------------------------------

Looks like this test is sensitive to WAL roll timing. After the patch, for 
whatever reason one of the WALs is rolled with some data (after the start of 
the test?), and some entries in it get replicated to cluster "2", but not to 
the "test..." cluster. A variable number of entries gets lost, because 
cleanOldLogs cleans this WAL out before the test calls addPeer.
I'm not sure if this is a test issue, or replication code issue, still looking 
at some debug logs I've added.
E.g.
{noformat}
before (13 and 13 entries)
2018-12-16 13:17:00,224 INFO  [regionserver/localhost:0.logRoller] 
wal.AbstractFSWAL(677): Rolled WAL 
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,44577,1544994976293/localhost%2C44577%2C1544994976293.1544994984170
 with entries=13, filesize=3.36 KB; new WAL 
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,44577,1544994976293/localhost%2C44577%2C1544994976293.1544995019967
2018-12-16 13:17:00,399 INFO  [regionserver/localhost:0.logRoller] 
wal.AbstractFSWAL(677): Rolled WAL 
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,43414,1544994976487/localhost%2C43414%2C1544994976487.1544994984180
 with entries=13, filesize=3.35 KB; new WAL 
/user/root/test-data/0c61bca6-2e0e-ca87-eda3-8ec33cf82797/WALs/localhost,43414,1544994976487/localhost%2C43414%2C1544994976487.1544995020042

after (111 and 13 entries)
2018-12-16 14:07:07,832 INFO  [regionserver/localhost:0.logRoller] 
wal.AbstractFSWAL(677): Rolled WAL 
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,40797,1544997970689/localhost%2C40797%2C1544997970689.1544997979748
 with entries=111, filesize=12.25 KB; new WAL 
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,40797,1544997970689/localhost%2C40797%2C1544997970689.1544998027469
2018-12-16 14:07:08,668 INFO  [regionserver/localhost:0.logRoller] 
wal.AbstractFSWAL(677): Rolled WAL 
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,46656,1544997971250/localhost%2C46656%2C1544997971250.1544997979723
 with entries=13, filesize=3.36 KB; new WAL 
/user/root/test-data/4595b84b-0ce4-55ad-2e80-4feeb237f853/WALs/localhost,46656,1544997971250/localhost%2C46656%2C1544997971250.1544998028193
{noformat}
In the 2nd failed case, I see 1200 entries replicated to the cluster "2" 
created by base test class (or leaked by some other test, not sure), and 1102 
replicated to "testInterClusterReplication" - because 111 minus 13 = 98 entries 
from the rolled WAL are never replicated to it.

> race condition in WAL rolling resulting in size-based rolling getting stuck
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-21564
>                 URL: https://issues.apache.org/jira/browse/HBASE-21564
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HBASE-21564.master.001.patch, 
> HBASE-21564.master.002.patch, HBASE-21564.master.003.patch, 
> HBASE-21564.master.004.patch
>
>
> Manifests at least with AsyncFsWriter.
> There's a window after LogRoller replaces the writer in the WAL, but before 
> it sets the rollLog boolean to false in the finally, where the WAL class can 
> request another log roll (it can happen in particular when the logs are 
> getting archived in the LogRoller thread, and there's high write volume 
> causing the logs to roll quickly).
> LogRoller will blindly reset the rollLog flag in finally and "forget" about 
> this request.
> AsyncWAL in turn never requests it again because its own rollRequested field 
> is set and it expects a callback. Logs don't get rolled until a periodic roll 
> is triggered after that.
> The acknowledgment of roll requests by LogRoller should be atomic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21564) race condition in WAL rolling resulting in size-based rolling getting stuck

Reply via email to