[
https://issues.apache.org/jira/browse/HBASE-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sun Xin updated HBASE-27354:
----------------------------
Description:
In
[WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257],
it is possible that we read uncommitted data. If we read beyond the committed
file length, then reopen inputStream and seek back.
In our use, we found that the position where seek back may be exactly the
length of the file being written, which may cause EOF.
The thrown EOF is finally caught
[ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158],
but
[totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78]
is not cleanup up.
After a long run, all peers will go slow and eventually block completely.
was:
In
[WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257],
it is possible that we read uncommitted data. If we read beyond the committed
file length, then reopen the
inputStream and seek back.
In our use, we found that the position where seek back may be exactly the
length of the file being written, which may cause EOF.
The thrown EOF is finally caught
[ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158],
but
[totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78]
is not cleanup up.
After a long run, all peers will go slow and eventually block completely.
> EOF thrown by WALEntryStream causes replication blocking
> --------------------------------------------------------
>
> Key: HBASE-27354
> URL: https://issues.apache.org/jira/browse/HBASE-27354
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.14
> Reporter: Sun Xin
> Assignee: Sun Xin
> Priority: Major
>
> In
> [WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257],
> it is possible that we read uncommitted data. If we read beyond the
> committed file length, then reopen inputStream and seek back.
> In our use, we found that the position where seek back may be exactly the
> length of the file being written, which may cause EOF.
> The thrown EOF is finally caught
> [ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158],
> but
> [totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78]
> is not cleanup up.
> After a long run, all peers will go slow and eventually block completely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)