[
https://issues.apache.org/jira/browse/HBASE-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066370#comment-15066370
]
Ashish Singhi commented on HBASE-14937:
---------------------------------------
The increment in the the timeout value of rpc call will be done only when we
get CallTimeoutException for all other exception types the code remains the
same.
Now suppose due to some issue where in we were able to connect to peer cluster
but could not replicate the data and after lot of retries we calculate the
timeout value to say 5 hours then during this call if the peer cluster is back
after two hours then this will resume and succeed so there is no blocking of
replication activity as such.
I tried to simulate this on my local cluster, first I made the peer cluster
HBase service down so the client was getting ConnectException hence there was
no increase in the rpc timeout value and second by keeping a debug point in
replication flow in the peer cluster and was not allowing replication activity
to complete in the set rpc timeout value where the client was getting
CallTimeoutException for 2-3 times and as per the patch it increased the rpc
timeout here then on a new call after receiving the call in the peer cluster
released the debug point after some time and replication activity begun
immediately.
Please let me know if this address your concerns or any other thing you would
like me to check ?
> Make rpc call timeout for replication adaptive
> ----------------------------------------------
>
> Key: HBASE-14937
> URL: https://issues.apache.org/jira/browse/HBASE-14937
> Project: HBase
> Issue Type: Improvement
> Reporter: Ashish Singhi
> Assignee: Ashish Singhi
> Labels: replication
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-14937.patch
>
>
> When peer cluster replication is disabled and lot of writes are happening in
> active cluster and later on peer cluster replication is enabled then there
> are chances that replication requests to peer cluster may time out.
> This is possible after HBASE-13153 and it can also happen with many and many
> WAL data replication still pending to replicate.
> Approach to this problem will be discussed in the comments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)