[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681697#comment-16681697
 ] 

Josh Elser commented on HBASE-21461:
------------------------------------

{quote}It will still replicate in same sequence, however in several batches, 
instead of a single large one. This is currently done synchronously. Also, it 
preserves the OP original timestamp from source, which I think is the most 
critical here to maintain the correct state.
{quote}
Ok, cool. When you put it that way, I agree :). My brain is still sputtering to 
get started.
{quote}This CP, however, is thought more as an admin tool (that's why I propose 
it as part of operators tools)
{quote}
Gotcha. I don't think we have a well-defined "measure" of what we want to put 
into operator-tools yet. My only concern is that this may be pigeon-hole'd into 
only having relevance for a small amount of deploys. However, even if one 
person finds value from it, it's probably worth it.

[~stack] or [~busbey], any thoughts on including such a tool into 
operator-tools?
{quote}Yeah, definitely worth try it, I haven't evaluated such backport 
actually, I was trying to integrate it on our own distribution that's based on 
1.2 (with some divergences), but couldn't manage to get it working properly. I 
can try a "pure" branch-1.2, though.
{quote}
Cool, that's definitely a parallel thread for us to keep a finger on. Making 
sure our upstream has the necessary changes when we can make them is important.

Thanks for the great info, Wellington. Making my life easy :)

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21461
>                 URL: https://issues.apache.org/jira/browse/HBASE-21461
>             Project: HBase
>          Issue Type: New Feature
>          Components: hbase-operator-tools, Replication
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to