[
https://issues.apache.org/jira/browse/HBASE-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275673#comment-17275673
]
Hudson commented on HBASE-25536:
--------------------------------
Results for branch branch-2.3
[build #161 on
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/General_20Nightly_20Build_20Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> Remove 0 length wal file from logQueue if it belongs to old sources.
> --------------------------------------------------------------------
>
> Key: HBASE-25536
> URL: https://issues.apache.org/jira/browse/HBASE-25536
> Project: HBase
> Issue Type: Improvement
> Components: Replication
> Affects Versions: 1.6.0
> Reporter: Rushabh Shah
> Assignee: Rushabh Shah
> Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.2.7, 2.5.0, 2.3.5, 2.4.2
>
>
> In our production clusters, we found one case where RS is not removing 0
> length file from replication queue (in memory one not the zk replication
> queue) if the logQueue size is 1.
> Stack trace below:
> {noformat}
> 2021-01-28 14:44:18,434 ERROR [,60020,1609950703085]
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
> java.io.EOFException:
> hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a
> SequenceFile
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
> Caused by: java.io.EOFException:
> hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a
> SequenceFile
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
> ... 1 more
> {noformat}
> The wal in question is of length 0 (verified via hadoop ls command) and is
> from recovered sources. There is just 1 log file in the queue (verified via
> heap dump).
> We have logic to remove 0 length log file from queue when we encounter
> EOFException and logQueue#size is greater than 1. Code snippet below.
> {code:java|title=ReplicationSourceWALReader.java|borderStyle=solid}
> // if we get an EOF due to a zero-length log, and there are other logs in
> queue
> // (highly likely we've closed the current log), we've hit the max retries,
> and autorecovery is
> // enabled, then dump the log
> private void handleEofException(IOException e) {
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
> logQueue.size() > 1 && this.eofAutoRecovery) {
> try {
> if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
> LOG.warn("Forcing removal of 0 length log in queue: " +
> logQueue.peek());
> logQueue.remove();
> currentPosition = 0;
> }
> } catch (IOException ioe) {
> LOG.warn("Couldn't get file length information about log " +
> logQueue.peek());
> }
> }
> }
> {code}
> This size check is valid for active sources where we need to have atleast one
> wal file which is the current wal file but for recovered sources where we
> don't add current wal file to queue, we can skip the logQueue#size check.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)