[
https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685419#comment-16685419
]
Sean Busbey commented on HBASE-20604:
-------------------------------------
My understanding is that a good portion of our issues in this code are caused
by Hadoop not really defining what to expect when a client has a concurrent
read open on a file that's still open for write. This is usually where the
problems in our WAL reading code comes up; our replication system is relying on
assumptions that aren't really documented anywhere.
AFAIK there's no UT because we have never been able to isolate the problem(s)
that poke up in production around this, and trying to mock out the various
levels of abstraction went poorly when last I tried.
> ProtobufLogReader#readNext can incorrectly loop to the same position in the
> stream until the the WAL is rolled
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-20604
> URL: https://issues.apache.org/jira/browse/HBASE-20604
> Project: HBase
> Issue Type: Bug
> Components: Replication, wal
> Affects Versions: 3.0.0
> Reporter: Esteban Gutierrez
> Assignee: Esteban Gutierrez
> Priority: Critical
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch,
> HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch
>
>
> Every time we call {{ProtobufLogReader#readNext}} we consume the input stream
> associated to the {{FSDataInputStream}} from the WAL that we are reading.
> Under certain conditions, e.g. when using the encryption at rest
> ({{CryptoInputStream}}) the stream can return partial data which can cause a
> premature EOF that cause {{inputStream.getPos()}} to return to the same
> origina position causing {{ProtobufLogReader#readNext}} to re-try over the
> reads until the WAL is rolled.
> The side effect of this issue is that {{ReplicationSource}} can get stuck
> until the WAL is rolled and causing replication delays up to an hour in some
> cases.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)