[
https://issues.apache.org/jira/browse/HBASE-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260345#comment-14260345
]
Jean-Daniel Cryans commented on HBASE-12770:
--------------------------------------------
bq. is it reasonable that the alive server only transfer one peer's hlogs from
the dead server
Just to make sure I understand you correctly, you're saying that in the case
where we have many peers, instead of one server grabbing all the queues we
should instead have the servers grab only one so that the load is spread out.
If so, this is reasonable, but I would change the jira's title to reflect that
(it's kind of vague right now, unless that was the goal to spur discussion).
bq. Regionservers could publish their queue depths and if there is a
substantial difference detected, one RS could transfer a queue to the less
loaded peer.
Yeah, but the details might be hard to get correctly. Might be a reason why a
particular queue is growing (blocked on whatever) so it will start bouncing
around. We could at least have admin tools to move queues though, right now
we'd have to move all of them at the same time but with above's work it might
be possible to do it more granularly.
> Don't transfer all the queued hlogs of a dead server to the same alive server
> -----------------------------------------------------------------------------
>
> Key: HBASE-12770
> URL: https://issues.apache.org/jira/browse/HBASE-12770
> Project: HBase
> Issue Type: Improvement
> Components: Replication
> Reporter: cuijianwei
> Priority: Minor
>
> When a region server is down(or the cluster restart), all the hlog queues
> will be transferred by the same alive region server. In a shared cluster, we
> might create several peers replicating data to different peer clusters. There
> might be lots of hlogs queued for these peers caused by several reasons, such
> as some peers might be disabled, or errors from peer cluster might prevent
> the replication, or the replication sources may fail to read some hlog
> because of hdfs problem. Then, if the server is down or restarted, another
> alive server will take all the replication jobs of the dead server, this
> might bring a big pressure to resources(network/disk read) of the alive
> server and also is not fast enough to replicate the queued hlogs. And if the
> alive server is down, all the replication jobs including that takes from
> other dead servers will once again be totally transferred to another alive
> server, this might cause a server have a large number of queued hlogs(in our
> shared cluster, we find one server might have thousands of queued hlogs for
> replication). As an optional way, is it reasonable that the alive server only
> transfer one peer's hlogs from the dead server one time? Then, other alive
> region servers might have the opportunity to transfer the hlogs of rest
> peers. This may also help the queued hlogs be processed more fast. Any
> discussion is welcome.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)