[
https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vincent Poon updated HBASE-18137:
---------------------------------
Release Note: 0-length WAL files can potentially cause the replication
queue to get stuck. A new config "replication.source.eof.autorecovery" has
been added: if set to true (default is false), the 0-length WAL file will be
skipped after 1) the max number of retries has been hit, and 2) there are more
WAL files in the queue. The risk of enabling this is that there is a chance
the 0-length WAL file actually has some data (e.g. block went missing and will
come back once a datanode is recovered). (was: 0-length WAL files can
potentially cause the replication queue to get stuck. A new config
"replication.source.eof.autorecovery" has been added, if set to true (default
is false), the 0-length WAL file will be skipped after 1) the max number of
retries has been hit, and 2) there are more WAL files in the queue.)
> Replication gets stuck for empty WALs
> -------------------------------------
>
> Key: HBASE-18137
> URL: https://issues.apache.org/jira/browse/HBASE-18137
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 1.3.1
> Reporter: Ashu Pachauri
> Assignee: Vincent Poon
> Priority: Critical
> Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
> Attachments: HBASE-18137.branch-1.3.v1.patch,
> HBASE-18137.branch-1.3.v2.patch, HBASE-18137.branch-1.3.v3.patch,
> HBASE-18137.branch-1.v1.patch, HBASE-18137.branch-1.v2.patch,
> HBASE-18137.master.v1.patch
>
>
> Replication assumes that only the last WAL of a recovered queue can be empty.
> But, intermittent DFS issues may cause empty WALs being created (without the
> PWAL magic), and a roll of WAL to happen without a regionserver crash. This
> will cause recovered queues to have empty WALs in the middle. This cause
> replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got:
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
> at
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the
> recovered queue which were newer and non-empty.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)