[
https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677318#comment-16677318
]
Sean Busbey commented on HBASE-20604:
-------------------------------------
{code}
@@ -416,7 +420,15 @@ public class ProtobufLogReader extends ReaderBase {
if (LOG.isTraceEnabled()) {
LOG.trace("Encountered a malformed edit, seeking back to last good
position in file, from "+ inputStream.getPos()+" to " + originalPosition, eof);
}
- seekOnFs(originalPosition);
+ // If stuck at the same place and we got and exception, lets go back
at the beginning.
+ if (inputStream.getPos() == originalPosition && resetPosition) {
+ if (LOG.isTraceEnabled()) {
+ LOG.trace("Seeking to the beginning of the WAL, current position "
+ originalPosition + " is the same as the original position.");
+ }
+ seekOnFs(0);
+ } else {
+ seekOnFs(originalPosition);
+ }
{code}
The {{LOG.trace}} block just before this addition should be inside of the
{{else}} clause that's added, because currently in the "reset to start" case
we're effectively duplicating the TRACE messages.
After the above, the {{LOG.trace}} message provided when we seek to the start
should include in the why ("original and current positions match") that we got
a malformed edit.
With those two changes and the long line from checkstyle corrected, I'm +1.
> ProtobufLogReader#readNext can incorrectly loop to the same position in the
> stream until the the WAL is rolled
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-20604
> URL: https://issues.apache.org/jira/browse/HBASE-20604
> Project: HBase
> Issue Type: Bug
> Components: Replication, wal
> Affects Versions: 3.0.0
> Reporter: Esteban Gutierrez
> Assignee: Esteban Gutierrez
> Priority: Critical
> Attachments: HBASE-20604.002.patch, HBASE-20604.patch
>
>
> Every time we call {{ProtobufLogReader#readNext}} we consume the input stream
> associated to the {{FSDataInputStream}} from the WAL that we are reading.
> Under certain conditions, e.g. when using the encryption at rest
> ({{CryptoInputStream}}) the stream can return partial data which can cause a
> premature EOF that cause {{inputStream.getPos()}} to return to the same
> origina position causing {{ProtobufLogReader#readNext}} to re-try over the
> reads until the WAL is rolled.
> The side effect of this issue is that {{ReplicationSource}} can get stuck
> until the WAL is rolled and causing replication delays up to an hour in some
> cases.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)