[ 
https://issues.apache.org/jira/browse/HBASE-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398357#comment-17398357
 ] 

Andrew Kyle Purtell commented on HBASE-26195:
---------------------------------------------

Approach 3 is the correct approach. The other approaches are insufficient and 
should not be considered. We are getting away from WAL as source of truth by 
rolling back the memstore. 

Append and sync are two separate operations. Without a successful sync we 
cannot know if the append has been made durable and visible to replication WAL 
tailers. Maybe, maybe not. We have seen source and sink sides of a replication 
edge become inconsistent because of this scenario, triggered by a namenode 
failover. However it seems possible, if rare, in any high load scenario. 

We are only guaranteed to be fully consistent on both the source and sink sides 
of a replication edge if we see both append and sync succeed, or if we abort 
the regionserver if either the append or the sync fails. When we abort we cause 
the system to fall back to the WAL as source of truth, for both source and sink 
sides.

> Data is present in replicated cluster but not present in primary cluster.
> -------------------------------------------------------------------------
>
>                 Key: HBASE-26195
>                 URL: https://issues.apache.org/jira/browse/HBASE-26195
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, wal
>    Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>
> We encountered a case where we are seeing some rows (via Phoenix) in 
> replicated cluster but they are not present in source/active cluster.
> Triaging further we found memstore rollback logs in few of the region servers.
> {noformat}
> 2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> {noformat}
> Looking more into logs, found that there were some hdfs layer issues sync'ing 
> wal to hdfs.
> It was taking around 6 mins to sync wal. Logs below
> {noformat}
> 2021-07-28 14:19:30,511 WARN  [sync.0] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391210ms (threshold=30000ms). File being written: 
> /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.1] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391148ms (threshold=30000ms). File being written: 
> /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.2] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391147ms (threshold=30000ms). File being written: 
> /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,591 INFO  [sync.0] wal.FSHLog - Slow sync cost: 391289 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 INFO  [sync.1] wal.FSHLog - Slow sync cost: 391227 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.1] wal.FSHLog - Requesting log roll 
> because we exceeded slow sync threshold; time=391227 ms, threshold=10000 ms, 
> current pipeline: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 INFO  [sync.2] wal.FSHLog - Slow sync cost: 391227 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.2] wal.FSHLog - Requesting log roll 
> because we exceeded slow sync threshold; time=391227 ms, threshold=10000 ms, 
> current pipeline: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.3] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391217ms (threshold=30000ms). File being written: 
> /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,591 INFO  [sync.3] wal.FSHLog - Slow sync cost: 391217 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> {noformat}
> Just to add more details, we also found there was a namenode failover. Might 
> be a contributing factor on why sync's were slow but not sure.
> {noformat}
> 2021-07-28 14:19:30,390 WARN  [<rs-host>:60020-3] 
> retry.RetryInvocationHandler - A failover has occurred since the start of 
> call #8491727 Client
> NamenodeProtocolTranslatorPB.mkdirs over 
> <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,390 WARN  [<rs-host>:60020-7] 
> retry.RetryInvocationHandler - A failover has occurred since the start of 
> call #8491726 ClientNamenodeProtocolTranslatorPB.mkdirs over 
> <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [20%2C1626191371499,1] 
> retry.RetryInvocationHandler - A failover has occurred since the start of 
> call #8491723 ClientNamenodeProtocolTranslatorPB.getBlockLocations over 
> <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [0.40:60020.logRoller] 
> retry.RetryInvocationHandler - A failover has occurred since the start of 
> call #8491729 ClientNamenodeProtocolTranslatorPB.getFileInfo over 
> <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [MemStoreFlusher.1] 
> retry.RetryInvocationHandler - A failover has occurred since the start of 
> call #8491730 ClientNamenodeProtocolTranslatorPB.getFileInfo over 
> <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [371499.1627480615620] 
> retry.RetryInvocationHandler - A failover has occurred since the start of 
> call #8491731 ClientNamenodeProtocolTranslatorPB.addBlock over 
> <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [c@prod] retry.RetryInvocationHandler - A 
> failover has occurred since the start of call #8491725 
> ClientNamenodeProtocolTranslatorPB.renewLease over 
> <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [91371499@<Replicatonsource-thread>] 
> retry.RetryInvocationHandler - A failover has occurred since the start of 
> call #8491724 ClientNamenodeProtocolTranslatorPB.getBlockLocations over 
> <namenode-hostname>/<namenode-ip>:8020
> {noformat}
> We use the default timeout for hbase.regionserver.wal.sync.timeout which is 5 
> mins.
> Since the sync took more than 6 minutes (which is more than 
> hbase.regionserver.wal.sync.timeout), the client received an Exception and we 
> rolled back tje mutations in memstore on the primary cluster.
> But the hdfs sync succeeded  +eventually+, the row was present in wal and was 
> eligible for replication.
> [~bharathv] [[email protected]] [~gjacoby] helped brainstorming for a 
> possible solution. Will update the possible approaches in the comments 
> section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to