[ 
https://issues.apache.org/jira/browse/HBASE-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864295#comment-15864295
 ] 

Ashu Pachauri commented on HBASE-12770:
---------------------------------------

[~yangzhe1991] [~Apache9] [~mantonov] The underlying issue is quite painful at 
times leading to accumulation of large queues on only one region server. The 
change also seems pretty non-intrusive to me and it'll nice to have it in our 
prod environment. I checked and the patch for branch-1 cleanly applies on 1.3; 
any objections to a backport to 1.3 for 1.3.1 point release?

> Don't transfer all the queued hlogs of a dead server to the same alive server
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-12770
>                 URL: https://issues.apache.org/jira/browse/HBASE-12770
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>    Affects Versions: 2.0.0, 1.4.0
>            Reporter: Jianwei Cui
>            Assignee: Phil Yang
>            Priority: Minor
>             Fix For: 2.0.0, 1.4.0
>
>         Attachments: HBASE-12770-branch-1-v1.patch, 
> HBASE-12770-branch-1-v2.patch, HBASE-12770-branch-1-v3.patch, 
> HBASE-12770-branch-1-v3.patch, HBASE-12770-branch-1-v3.patch, 
> HBASE-12770-branch-1-v3.patch, HBASE-12770-trunk.patch, HBASE-12770-v1.patch, 
> HBASE-12770-v2.patch, HBASE-12770-v3.patch, HBASE-12770-v3.patch
>
>
> When a region server is down(or the cluster restart), all the hlog queues 
> will be transferred by the same alive region server. In a shared cluster, we 
> might create several peers replicating data to different peer clusters. There 
> might be lots of hlogs queued for these peers caused by several reasons, such 
> as some peers might be disabled, or errors from peer cluster might prevent 
> the replication, or the replication sources may fail to read some hlog 
> because of hdfs problem. Then, if the server is down or restarted, another 
> alive server will take all the replication jobs of the dead server, this 
> might bring a big pressure to resources(network/disk read) of the alive 
> server and also is not fast enough to replicate the queued hlogs. And if the 
> alive server is down, all the replication jobs including that takes from 
> other dead servers will once again be totally transferred to another alive 
> server, this might cause a server have a large number of queued hlogs(in our 
> shared cluster, we find one server might have thousands of queued hlogs for 
> replication). As an optional way, is it reasonable that the alive server only 
> transfer one peer's hlogs from the dead server one time? Then, other alive 
> region servers might have the opportunity to transfer the hlogs of rest 
> peers. This may also help the queued hlogs be processed more fast. Any 
> discussion is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to