[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch

Josh Elser (JIRA) Fri, 09 Nov 2018 08:54:43 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681697#comment-16681697
 ]


Josh Elser commented on HBASE-21461:
------------------------------------

{quote}It will still replicate in same sequence, however in several batches, 
instead of a single large one. This is currently done synchronously. Also, it 
preserves the OP original timestamp from source, which I think is the most 
critical here to maintain the correct state.
{quote}
Ok, cool. When you put it that way, I agree :). My brain is still sputtering to 
get started.
{quote}This CP, however, is thought more as an admin tool (that's why I propose 
it as part of operators tools)
{quote}
Gotcha. I don't think we have a well-defined "measure" of what we want to put 
into operator-tools yet. My only concern is that this may be pigeon-hole'd into 
only having relevance for a small amount of deploys. However, even if one 
person finds value from it, it's probably worth it.

[~stack] or [~busbey], any thoughts on including such a tool into 
operator-tools?
{quote}Yeah, definitely worth try it, I haven't evaluated such backport 
actually, I was trying to integrate it on our own distribution that's based on 
1.2 (with some divergences), but couldn't manage to get it working properly. I 
can try a "pure" branch-1.2, though.
{quote}
Cool, that's definitely a parallel thread for us to keep a finger on. Making 
sure our upstream has the necessary changes when we can make them is important.

Thanks for the great info, Wellington. Making my life easy :)

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21461
>                 URL: https://issues.apache.org/jira/browse/HBASE-21461
>             Project: HBase
>          Issue Type: New Feature
>          Components: hbase-operator-tools, Replication
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch

Reply via email to