[ https://issues.apache.org/jira/browse/HBASE-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926392#action_12926392 ]
Jonathan Gray commented on HBASE-3168: -------------------------------------- Okay, saw your message up on list. Please do go for it but we should get it in to 0.90 because this can cause more trouble now than it did previously. I'll be running this at something like 5 seconds though it should definitely be configurable. The issue that messed up my cluster a bit would have happened if the RS and master were more than 30 seconds out of sync at the time. This 30 seconds is related to the configurable timeout of region-in-transition operations. The master always thought the RS had been opening a region for more than 30 seconds so timed it out continuously. In the real world, a simple ntp setup keeps things well aligned and under 1 second. When you're talking many hundreds or thousands of nodes, there are always exceptions. They can be caused by human/ops error, hardware error, software, whatever. So long-term we may need to think about better design around how we use timestamps across the cluster. Well beyond the scope of this jira though :) > Sanity date and time check when a region server joins the cluster > ----------------------------------------------------------------- > > Key: HBASE-3168 > URL: https://issues.apache.org/jira/browse/HBASE-3168 > Project: HBase > Issue Type: Improvement > Components: regionserver > Affects Versions: 0.89.20100924 > Environment: RHEL 5.5 64bit, 1 Master 4 Region Servers > Reporter: Jeff Whiting > > Introduce a sanity check when a RS joins the cluster to make sure its clock > isn't too far out of skew with the rest of the cluster. If the RS's time is > too far out of skew then the master would prevent it from joining and RS > would die and log the error. > Having a RS with even small differences in time can cause huge problems due > to how bhase stores values with timestamps. > According to J-D in ServerManager we are already doing: > {code} > HServerInfo info = new HServerInfo(serverInfo); > checkIsDead(info.getServerName(), "STARTUP"); > checkAlreadySameHostPort(info); > recordNewServer(info, false, null); > {code} > And that the new check would fit in nicely there. > JG suggests we add a "ClockOutOfSync-like exception" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.