[jira] [Updated] (HBASE-27354) EOF thrown by WALEntryStream causes replication blocking

Sun Xin (Jira) Thu, 01 Sep 2022 04:20:37 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sun Xin updated HBASE-27354:
----------------------------
    Description: 
In 
[WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257],
 it is possible that we read uncommitted data.  If we read beyond the committed 
file length, then reopen inputStream and seek back.

In our use, we found that the position where seek back may be exactly the 
length of the file being written, which may cause EOF.

The thrown EOF is finally caught 
[ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158],
 but 
[totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78]
 is not cleanup up.

After a long run, all peers will go slow and eventually block completely.

  was:
In 
[WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257],
 it is possible that we read uncommitted data.  If we read beyond the committed 
file length, then reopen the 

inputStream and seek back.

In our use, we found that the position where seek back may be exactly the 
length of the file  being written, which may cause EOF.

The thrown EOF is finally caught 
[ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158],
 but 
[totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78]
 is not cleanup up.

After a long run, all peers will go slow and eventually block completely.


> EOF thrown by WALEntryStream causes replication blocking
> --------------------------------------------------------
>
>                 Key: HBASE-27354
>                 URL: https://issues.apache.org/jira/browse/HBASE-27354
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.14
>            Reporter: Sun Xin
>            Assignee: Sun Xin
>            Priority: Major
>
> In 
> [WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257],
>  it is possible that we read uncommitted data.  If we read beyond the 
> committed file length, then reopen inputStream and seek back.
> In our use, we found that the position where seek back may be exactly the 
> length of the file being written, which may cause EOF.
> The thrown EOF is finally caught 
> [ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158],
>  but 
> [totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78]
>  is not cleanup up.
> After a long run, all peers will go slow and eventually block completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27354) EOF thrown by WALEntryStream causes replication blocking

Reply via email to