[jira] [Commented] (HBASE-15252) Data loss when replaying wal if HDFS timeout

Duo Zhang (JIRA) Wed, 10 Feb 2016 22:29:41 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-15252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142304#comment-15142304
 ]


Duo Zhang commented on HBASE-15252:
-----------------------------------

[~stack] Oh yes, it is used for replication...

{code:title=ReplicationWALReaderManager.java}
  /**
   * Get the next entry, returned and also added in the array
   * @return a new entry or null
   * @throws IOException
   */
  public Entry readNextAndSetPosition() throws IOException {
    Entry entry = this.reader.next();
    // Store the position so that in the future the reader can start
    // reading from here. If the above call to next() throws an
    // exception, the position won't be changed and retry will happen
    // from the last known good position
    this.position = this.reader.getPosition();
    // We need to set the CC to null else it will be compressed when sent to 
the sink
    if (entry != null) {
      entry.setCompressionContext(null);
    }
    return entry;
  }
{code}

Here we set the position no matter what is returned from reader.next, so we 
must keep the position of the reader at a valid point, otherwise the 
{{ReplicationWALReaderManager}} will seek to a wrong position next time...

I think at least we need to change the comment. Now it seems that we will reuse 
the reader itself, but actually we only need the position, not the reader.
And since the only thing we need is a valid position, what about adding a new 
method that return the last valid position? It is less confusing I think. Or 
that, if reader.next returns null, we just skip update the position of 
{{ReplicationWALReaderManager}}?

Thanks.

> Data loss when replaying wal if HDFS timeout
> --------------------------------------------
>
>                 Key: HBASE-15252
>                 URL: https://issues.apache.org/jira/browse/HBASE-15252
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>         Attachments: HBASE-15252-testcase.patch
>
>
> This is a problem introduced by HBASE-13825 where we change the exception 
> type in catch block in {{readNext}} method of {{ProtobufLogReader}}.
> {code:title=ProtobufLogReader.java}
>       try {
>           ......
>           ProtobufUtil.mergeFrom(builder, new 
> LimitInputStream(this.inputStream, size),
>             (int)size);
>         } catch (IOException ipbe) { // <------ used to be 
> InvalidProtocolBufferException
>           throw (EOFException) new EOFException("Invalid PB, EOF? Ignoring; 
> originalPosition=" +
>             originalPosition + ", currentPosition=" + 
> this.inputStream.getPos() +
>             ", messageSize=" + size + ", currentAvailable=" + 
> available).initCause(ipbe);
>         }
> {code}
> Here if the {{inputStream}} throws an {{IOException}} due to timeout or 
> something, we just convert it to an {{EOFException}} and at the bottom of 
> this method, we ignore {{EOFException}} and return false. This cause the 
> upper layer think we reach the end of file. So when replaying we will treat 
> the HDFS timeout error as a normal end of file and cause data loss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15252) Data loss when replaying wal if HDFS timeout

Reply via email to