[ 
https://issues.apache.org/jira/browse/HBASE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998857#comment-12998857
 ] 

Sean Sechrist commented on HBASE-3558:
--------------------------------------

We didn't really have a clue what was the source of the problems until we saw 
messages like this in /var/log/syslog on all of our our nodes, around the same 
time our region servers started going down:
{noformat}
Feb 22 22:50:22 datanode-a7 ntpd[2754]: time reset -69.993786 s
Feb 22 23:07:12 datanode-a7 ntpd[2754]: time reset -69.988776 s
{noformat}
Then we confirmed that the system times on our nodes were out of sync, and 
fluctuating. We fixed it by starting our own local NTP server.

> Warnings if RS times are out of sync
> ------------------------------------
>
>                 Key: HBASE-3558
>                 URL: https://issues.apache.org/jira/browse/HBASE-3558
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.89.20100924
>            Reporter: Sean Sechrist
>            Priority: Minor
>
> Last night we ran into a problem with the times on RSs being out of sync by 1 
> minute. The times were being reset by ~70s often because we were getting 
> different responses from pool.ntpd.org.
> This caused lost ZK sessions and problems writing to datanodes,  so all the 
> RSs kept shutting down.
> I think it would be useful to have HBaseFsck check to see if the times on the 
> region servers are out of sync. Or maybe put a warning on the master web ui 
> or something. 
> This seems related to HBASE-3168, but applies when region servers become out 
> of sync once they already joined the cluster (due to NTP issues or something 
> else).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to