Wellington Chevreuil created HBASE-21461:
--------------------------------------------
Summary: Region CoProcessor for splitting large WAL entries in
smaller batches, to handle situation when faulty ingestion had created too many
mutations for same cell in single batch
Key: HBASE-21461
URL: https://issues.apache.org/jira/browse/HBASE-21461
Project: HBase
Issue Type: New Feature
Components: hbase-operator-tools, Replication
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
With replication enabled deployments, it's possible that faulty ingestion
clients may lead to single WalEntry containing too many edits for same cell.
This would cause *ReplicationSink,* in the target cluster, to attempt single
batch mutation with too many operations, what in turn can lead to very large
RPC requests, which may not fit in the final target RS rpc queue. In this case,
the messages below are seen on target RS trying to perform the sink:
{noformat}
WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME,
attempt=4/4 failed=2ops, last exception:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size
too small? on regionserver01.example.com,60020,1524334173359, tracking started
Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
2018-09-07 10:40:59,506 ERROR
org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to
accept edit because:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2
actions: RemoteWithExtrasException: 2 times,
at
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
at
org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
at
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
When this problem manifests, replication will be stuck and wal files will be
piling up on source cluster WALs/oldWALs folder. Typical workaround requires
manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL
files containing the large entry.
This CP would handle the issue, by checking for large wal entries and splitting
those into smaller batches on the *reReplicateLogEntries* method hook.
*Additional Note*: HBASE-18027 introduced some safeguards such large RPC
requests, which may already help avoid such scenario. That is not available for
1.2 releases, though, and this CP tool may still be relevant for 1.2 clusters.
It may also be still worth having it to workaround any potential unknown large
RPC issue scenarios.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)