Jerry He created HBASE-16581:
--------------------------------

             Summary: Optimize Replication queue transfers after server fail 
over
                 Key: HBASE-16581
                 URL: https://issues.apache.org/jira/browse/HBASE-16581
             Project: HBase
          Issue Type: Improvement
            Reporter: Jerry He
            Assignee: Jerry He


Currently if a region server fails, the replication queue of this server will 
be picked up by another region server.  The problem is this queue can possibly 
be huge and contains queues from other multiple or cascading server failures.

We had such a case in production.  From zk_dump:

{code}
...
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467603735059:
 18748267

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471723778060:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468258960080:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468204958990:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469701010649:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1470409989238:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471838985073:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467142915090:
 57804890

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1472181000614:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471464567365:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469486466965:
 

/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467787339841:
 47812951
...
{code}

There were hundreds of wals hanging under this queue, coming from diferent 
region servers, which took a long time to replicate.

We should have a better strategy which lets live region servers each grep part 
of this nested queue, and replicate in parallel.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to