[ 
https://issues.apache.org/jira/browse/HBASE-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066370#comment-15066370
 ] 

Ashish Singhi commented on HBASE-14937:
---------------------------------------

The increment in the the timeout value of rpc call will be done only when we 
get CallTimeoutException for all other exception types the code remains the 
same.
Now suppose due to some issue where in we were able to connect to peer cluster 
but could not replicate the data and after lot of retries we calculate the 
timeout value to say 5 hours then during this call if the peer cluster is back 
after two hours then this will resume and succeed so there is no blocking of 
replication activity as such.

I tried to simulate this on my local cluster, first I made the peer cluster 
HBase service down so the client was getting ConnectException hence there was 
no increase in the rpc timeout value and second by keeping a debug point in 
replication flow in the peer cluster and was not allowing replication activity 
to complete in the set rpc timeout value where the client was getting 
CallTimeoutException for 2-3 times and as per the patch it increased the rpc 
timeout here then on a new call after receiving the call in the peer cluster 
released the debug point after some time and replication activity begun 
immediately.

Please let me know if this address your concerns or any other thing you would 
like me to check ?

> Make rpc call timeout for replication adaptive
> ----------------------------------------------
>
>                 Key: HBASE-14937
>                 URL: https://issues.apache.org/jira/browse/HBASE-14937
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Ashish Singhi
>            Assignee: Ashish Singhi
>              Labels: replication
>             Fix For: 2.0.0, 1.3.0
>
>         Attachments: HBASE-14937.patch
>
>
> When peer cluster replication is disabled and lot of writes are happening in 
> active cluster and later on peer cluster replication is enabled then there 
> are chances that replication requests to peer cluster may time out.
> This is possible after HBASE-13153 and it can also happen with many and many 
> WAL data replication still pending to replicate.
> Approach to this problem will be discussed in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to