[jira] [Commented] (HBASE-25536) Remove 0 length wal file from logQueue if it belongs to old sources.

Hudson (Jira) Sat, 30 Jan 2021 09:18:07 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-25536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275673#comment-17275673
 ]


Hudson commented on HBASE-25536:
--------------------------------

Results for branch branch-2.3
        [build #161 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/]:
 (x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/161/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Remove 0 length wal file from logQueue if it belongs to old sources.
> --------------------------------------------------------------------
>
>                 Key: HBASE-25536
>                 URL: https://issues.apache.org/jira/browse/HBASE-25536
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>    Affects Versions: 1.6.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.2.7, 2.5.0, 2.3.5, 2.4.2
>
>
> In our production clusters, we found one case where RS is not removing 0 
> length file from replication queue (in memory one not the zk replication 
> queue) if the logQueue size is 1.
>  Stack trace below:
> {noformat}
> 2021-01-28 14:44:18,434 ERROR [,60020,1609950703085] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: 
> hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a 
> SequenceFile
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
> Caused by: java.io.EOFException: 
> hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a 
> SequenceFile
>       at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
>       ... 1 more
> {noformat}
> The wal in question is of length 0 (verified via hadoop ls command) and is 
> from recovered sources. There is just 1 log file in the queue (verified via 
> heap dump).
>  We have logic to remove 0 length log file from queue when we encounter 
> EOFException and logQueue#size is greater than 1. Code snippet below.
> {code:java|title=ReplicationSourceWALReader.java|borderStyle=solid}
>   // if we get an EOF due to a zero-length log, and there are other logs in 
> queue
>   // (highly likely we've closed the current log), we've hit the max retries, 
> and autorecovery is
>   // enabled, then dump the log
>   private void handleEofException(IOException e) {
>     if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>        logQueue.size() > 1 && this.eofAutoRecovery) {
>       try {
>         if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
>           LOG.warn("Forcing removal of 0 length log in queue: " + 
> logQueue.peek());
>           logQueue.remove();
>           currentPosition = 0;
>         }
>       } catch (IOException ioe) {
>         LOG.warn("Couldn't get file length information about log " + 
> logQueue.peek());
>       }
>     }
>   }
> {code}
> This size check is valid for active sources where we need to have atleast one 
> wal file which is the current wal file but for recovered sources where we 
> don't add current wal file to queue, we can skip the logQueue#size check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-25536) Remove 0 length wal file from logQueue if it belongs to old sources.

Reply via email to