[ 
https://issues.apache.org/jira/browse/HBASE-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926392#action_12926392
 ] 

Jonathan Gray commented on HBASE-3168:
--------------------------------------

Okay, saw your message up on list.  Please do go for it but we should get it in 
to 0.90 because this can cause more trouble now than it did previously.

I'll be running this at something like 5 seconds though it should definitely be 
configurable.

The issue that messed up my cluster a bit would have happened if the RS and 
master were more than 30 seconds out of sync at the time.  This 30 seconds is 
related to the configurable timeout of region-in-transition operations.  The 
master always thought the RS had been opening a region for more than 30 seconds 
so timed it out continuously.

In the real world, a simple ntp setup keeps things well aligned and under 1 
second.  When you're talking many hundreds or thousands of nodes, there are 
always exceptions.  They can be caused by human/ops error, hardware error, 
software, whatever.  So long-term we may need to think about better design 
around how we use timestamps across the cluster.  Well beyond the scope of this 
jira though :)

> Sanity date and time check when a region server joins the cluster
> -----------------------------------------------------------------
>
>                 Key: HBASE-3168
>                 URL: https://issues.apache.org/jira/browse/HBASE-3168
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.89.20100924
>         Environment: RHEL 5.5 64bit, 1 Master 4 Region Servers
>            Reporter: Jeff Whiting
>
> Introduce a sanity check when a RS joins the cluster to make sure its clock 
> isn't too far out of skew with the rest of the cluster.  If the RS's time is 
> too far out of skew then the master would prevent it from joining and RS 
> would die and log the error. 
> Having a RS with even small differences in time can cause huge problems due 
> to how bhase stores values with timestamps.
> According to J-D in ServerManager we are already doing: 
> {code}
>     HServerInfo info = new HServerInfo(serverInfo);
>     checkIsDead(info.getServerName(), "STARTUP");
>     checkAlreadySameHostPort(info);
>     recordNewServer(info, false, null);
> {code}
> And that the new check would fit in nicely there.
> JG suggests we add a "ClockOutOfSync-like exception"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to