[
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HBASE-21461 started by Wellington Chevreuil.
----------------------------------------------------
> Region CoProcessor for splitting large WAL entries in smaller batches, to
> handle situation when faulty ingestion had created too many mutations for
> same cell in single batch
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
> Issue Type: New Feature
> Components: hbase-operator-tools, Replication
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt,
> HBASE-21461-master.001.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion
> clients may lead to single WalEntry containing too many edits for same cell.
> This would cause *ReplicationSink,* in the target cluster, to attempt single
> batch mutation with too many operations, what in turn can lead to very large
> RPC requests, which may not fit in the final target RS rpc queue. In this
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME,
> attempt=4/4 failed=2ops, last exception:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
> Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size
> too small? on regionserver01.example.com,60020,1524334173359, tracking
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2
> actions: RemoteWithExtrasException: 2 times,
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and
> splitting those into smaller batches on the *reReplicateLogEntries* method
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC
> requests, which may already help avoid such scenario. That is not available
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2
> clusters. It may also be still worth having it to workaround any potential
> unknown large RPC issue scenarios.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)