[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682722#comment-16682722
 ] 

stack commented on HBASE-21461:
-------------------------------

Lets do option #2. I can help. Its too early (or too late -- smile) for #3.  I 
should try adding the patch here over in the tools repo?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21461
>                 URL: https://issues.apache.org/jira/browse/HBASE-21461
>             Project: HBase
>          Issue Type: New Feature
>          Components: hbase-operator-tools, Replication
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to