Josh Elser created HBASE-24779:
----------------------------------
Summary: Improve insight into replication WAL readers hung on
checkQuota
Key: HBASE-24779
URL: https://issues.apache.org/jira/browse/HBASE-24779
Project: HBase
Issue Type: Task
Reporter: Josh Elser
Assignee: Josh Elser
Helped a customer this past weekend who, with a large number of RegionServers,
has some RegionServers which replicated data to a peer without issues while
other RegionServers did not.
The number of queue logs varied over the past 24hrs in the same manner. Some
spikes in queued logs into 100's of logs, but other times, only 1's-10's of
logs were queued.
We were able to validate that there were "good" and "bad" RegionServers by
creating a test table, assigning it to a regionserver, enabling replication on
that table, and validating if the local puts were replicated to a peer. On a
good RS, data was replicated immediately. On a bad RS, data was never
replicated (at least, on the order of 10's of minutes which we waited).
On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) on
that RS were spending time in a Thread.sleep() in a different location than the
other. Specifically it was sitting in the
{{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the
{{handleEmptyWALBatch()}} method on the same class.
My only assumption is that, somehow, these RegionServers got into a situation
where they "allocated" memory from the quota but never freed it. Then, because
the WAL reader thinks it has no free memory, it blocks indefinitely and there
are no pending edits to ship and (thus) free that memory. A cursory glance at
the code gives me a _lot_ of anxiety around places where we don't properly
clean it up (e.g. batches that fail to ship, dropping a peer). As a first stab,
let me add some more debugging so we can actually track this state properly for
the operators and their sanity.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)