[
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681697#comment-16681697
]
Josh Elser commented on HBASE-21461:
------------------------------------
{quote}It will still replicate in same sequence, however in several batches,
instead of a single large one. This is currently done synchronously. Also, it
preserves the OP original timestamp from source, which I think is the most
critical here to maintain the correct state.
{quote}
Ok, cool. When you put it that way, I agree :). My brain is still sputtering to
get started.
{quote}This CP, however, is thought more as an admin tool (that's why I propose
it as part of operators tools)
{quote}
Gotcha. I don't think we have a well-defined "measure" of what we want to put
into operator-tools yet. My only concern is that this may be pigeon-hole'd into
only having relevance for a small amount of deploys. However, even if one
person finds value from it, it's probably worth it.
[~stack] or [~busbey], any thoughts on including such a tool into
operator-tools?
{quote}Yeah, definitely worth try it, I haven't evaluated such backport
actually, I was trying to integrate it on our own distribution that's based on
1.2 (with some divergences), but couldn't manage to get it working properly. I
can try a "pure" branch-1.2, though.
{quote}
Cool, that's definitely a parallel thread for us to keep a finger on. Making
sure our upstream has the necessary changes when we can make them is important.
Thanks for the great info, Wellington. Making my life easy :)
> Region CoProcessor for splitting large WAL entries in smaller batches, to
> handle situation when faulty ingestion had created too many mutations for
> same cell in single batch
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
> Issue Type: New Feature
> Components: hbase-operator-tools, Replication
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion
> clients may lead to single WalEntry containing too many edits for same cell.
> This would cause *ReplicationSink,* in the target cluster, to attempt single
> batch mutation with too many operations, what in turn can lead to very large
> RPC requests, which may not fit in the final target RS rpc queue. In this
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME,
> attempt=4/4 failed=2ops, last exception:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
> Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size
> too small? on regionserver01.example.com,60020,1524334173359, tracking
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2
> actions: RemoteWithExtrasException: 2 times,
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and
> splitting those into smaller batches on the *reReplicateLogEntries* method
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC
> requests, which may already help avoid such scenario. That is not available
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2
> clusters. It may also be still worth having it to workaround any potential
> unknown large RPC issue scenarios.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)