[jira] [Commented] (HBASE-18137) Replication gets stuck for empty WALs

Vincent Poon (JIRA) Wed, 07 Jun 2017 22:54:02 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042239#comment-16042239
 ]


Vincent Poon commented on HBASE-18137:
--------------------------------------

[~anoop.hbase] The master code is completely different, as the logic has been 
refactored and moved to ReplicationSourceWALReaderThread.  We would basically 
need to do something similar there - if sleepMultiplier hits the max due to 
EOFException, we can force the WALEntryStream to move onto the next log in 
queue.

Thinking on this some more, we can an additional check of the file length, and 
only consider dumping if the length is 0.  Is it possible for dfs to report a 
length of 0 when there's actually data somewhere?  

> Replication gets stuck for empty WALs
> -------------------------------------
>
>                 Key: HBASE-18137
>                 URL: https://issues.apache.org/jira/browse/HBASE-18137
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1
>            Reporter: Ashu Pachauri
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7
>
>         Attachments: HBASE-18137.branch-1.3.v1.patch
>
>
> Replication assumes that only the last WAL of a recovered queue can be empty. 
> But, intermittent DFS issues may cause empty WALs being created (without the 
> PWAL magic), and a roll of WAL to happen without a regionserver crash. This 
> will cause recovered queues to have empty WALs in the middle. This cause 
> replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got: 
> java.io.EOFException
>       at java.io.DataInputStream.readFully(DataInputStream.java:197)
>       at java.io.DataInputStream.readFully(DataInputStream.java:169)
>       at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the 
> recovered queue which were newer and non-empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HBASE-18137) Replication gets stuck for empty WALs

Reply via email to