[ 
https://issues.apache.org/jira/browse/HBASE-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319379#comment-15319379
 ] 

Sean Busbey commented on HBASE-15983:
-------------------------------------

The proximal failure that brought this to my attention is an error in handling 
offsets (but I don't know exactly what the root cause is yet). Here's a summary:

During our attempts to tail an in-progress WAL, at some point we mishandle some 
underlying error condition and get to a (saved) offset that is not a valid 
beginning of a message. The Reader properly gets and propogates a 
InvalidProtobufException and the ReplicationSource effectively treats this as 
"something happened at the end of the file, rewind." The problem is that the 
saved offset is bad, so rewinding just puts us back at the same location. We 
loop indefinitely so long as the WAL is the active one, then once it rolls we 
treat this failure as an end of file and dump the remainder of the file. In the 
particular deployment where this happened the result was 40-60% row loss.

I don't have a root cause yet, but I have a general work around that doesn't 
violate our current promises for replication (though it does make them more 
pronounced and more likely to be noticed). I plan to handle this in three 
subtasks:

# the workaround to ensure that in the case of a cleanly closed WAL file we are 
parsing all of the bytes that we expect to be present
# a docs update that makes ours promises around replication more precise 
(namely that we are at-least-once delivery, with no order guarantees)
# solving the proximal error on parsing while tailing the end of the active wal

> Replication improperly discards data from end-of-wal in some cases.
> -------------------------------------------------------------------
>
>                 Key: HBASE-15983
>                 URL: https://issues.apache.org/jira/browse/HBASE-15983
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.98.0, 1.0.0, 1.1.0, 1.2.0
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Critical
>             Fix For: 2.0.0, 1.3.0, 1.0.4, 1.4.0, 1.2.2, 0.98.20, 1.1.6
>
>
> In some particular deployments, the Replication code believes it has
> reached EOF for a WAL prior to successfully parsing all bytes known to
> exist in a cleanly closed file.
> The underlying issue is that several different underlying problems with a WAL 
> reader are all treated as end-of-file by the code in ReplicationSource that 
> decides if a given WAL is completed or needs to be retried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to