[
https://issues.apache.org/jira/browse/HBASE-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jerry He resolved HBASE-16581.
------------------------------
Resolution: Duplicate
> Optimize Replication queue transfers after server fail over
> -----------------------------------------------------------
>
> Key: HBASE-16581
> URL: https://issues.apache.org/jira/browse/HBASE-16581
> Project: HBase
> Issue Type: Improvement
> Reporter: Jerry He
> Assignee: Jerry He
>
> Currently if a region server fails, the replication queue of this server will
> be picked up by another region server. The problem is this queue can
> possibly be huge and contains queues from other multiple or cascading server
> failures.
> We had such a case in production. From zk_dump:
> {code}
> ...
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467603735059:
> 18748267
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471723778060:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468258960080:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468204958990:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469701010649:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1470409989238:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471838985073:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467142915090:
> 57804890
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1472181000614:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471464567365:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469486466965:
>
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467787339841:
> 47812951
> ...
> {code}
> There were hundreds of wals hanging under this queue, coming from diferent
> region servers, which took a long time to replicate.
> We should have a better strategy which lets live region servers each grep
> part of this nested queue, and replicate in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)