Re: [CentOS] Strange NTP problem

2008-05-22 Thread Alfred von Campe
I'm still having this issue.  Here is another update.  I noticed that  
the drift file for the system with the problem contained 0.000.  On  
most other systems this contains a positive number (and on two a  
negative number).  I deleted the drift file, resynch'ed the time with  
ntpdate hostname, restarted the NTP daemon, and waited for the  
drift file to be recreated.  It again contained 0.000 and the  
output of ntpq -np looked like this:


   remote   refid  st t when poll reach   delay
offset  jitter
   
 
==
   10.101.32.104   67.128.71.65 3 u   49   64  3770.611   
1871.40 987.132
  *127.127.1.0 .LOCL.  10 l   48   64  3770.000 
0.000   0.001


I replaced the drift file with the contents of the file before the  
upgrade to CentOS, resynch'ed the time, and restarted the NTP  
daemon.  But after a little while, the system is bound to itself  
again.


BTW, the output of the cron job running ntpdate once an hour showed  
that the system has a very steady drift of 14.9 seconds every hour.   
Does that seem excessive?  NTP should be able to handle this, right?


Alfred

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Strange NTP problem

2008-05-22 Thread Jason Clark
I had a similar problem on a different server that I fixed last night.
Evidently it had a BIOS level feature that tried to modify the CPU clock
rate, much like cpu-freq does within the kernel, and was doing so by
messing with the system clock impacting the RTC.   I was drifting all
over the place until I found and disabled that feature (foxcon board,
something like foxstep I believe is what it is called in BIOS). Not sure
if your lenovo boards have that feature, but i know that some ASUS
boards do.


Jason
www.cyborgworkshop.org


Paul Heinlein wrote:
 On Tue, 20 May 2008, Alfred von Campe wrote:
 
 I have 30 identical Lenovo desktop systems running CentOS 5.1.  On one
 of those systems the clock is running slow (5+ minutes from yesterday
 to this morning and another minute since this morning) despite the
 fact that NTP is running on all of them and they all have the exact
 same /etc/ntp.conf file (I compared the MD5 sums of that file on all
 the systems).  Here is the output of grep ntp /var/log messages on
 the system with the problem since I restarted the NTP daemon earlier
 today:
 
 A slew of 5 min/24 hrs should be in the range of fixable.
 
 May 20 11:35:38 hepdsw03 ntpd[31792]: frequency initialized 0.000 PPM
 from /var/lib/ntp/drift
 
 This is very suspect. Are there any SELinux or other log messages
 suggesting that ntpd isn't able to write to its drift file? Your local
 clock is definitely drifting, so a 0.000 value is bogus. It may indicate
 that there's a disconnect between ntpd and the filesystem.
 
 I'd be interested in the output of ntpdc -c kerninfo; on most systems
 the 'pll frequency' value is a close match to the figure in the drift file.
 
 May 20 11:38:55 hepdsw03 ntpd[31792]: synchronized to LOCAL(0),
 stratum 10
 May 20 11:38:55 hepdsw03 ntpd[31792]: kernel time sync disabled 0001
 May 20 11:39:59 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104,
 stratum 3
 
 This is ungood. Sync-ing to local before your network time server means
 that your machine doesn't want to believe your server -- and you should
 see a kernel time sync enabled message once the machine has sync-ed
 with the time server.
 
 You said the machines are identical. Could there be any variation in the
 BIOS revision level or its settings? Sometimes ACPI stuff can mess up ntp.
 
 Also -- the log messages you provide have no step time server
 reference. Do you have a valid /etc/ntp/step-tickers file?
 
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Strange NTP problem

2008-05-21 Thread Alfred von Campe

On May 21, 2008, at 0:55, Paul Heinlein wrote:


Yeah, with an offset of 54 seconds, it's a bad system. :-)

Try this (assuming 10.101.32.104 is your preferred local NTP server):

  service ntpd stop
  echo 10.101.32.104  /etc/ntp/step-tickers
  service ntpd start

Adding a server to the step-tickers file will tell the ntpd init  
script to do an ntpdate sync against that host before starting ntpd.


In effect, that's what I did manually.  When I found out that about  
this problem, I did the following:


  # service ntpd stop
  # ntpdate ntp
  # service ntpd start

(ntp is a CNAME record for our local NTP server).  But after 24  
hours, the system was off my more than 5 minutes.  Why would it not  
bind to the NTP server.  Here is some output from tshark while ntpd  
was running:


# tshark udp port 123
Running as user root and group root. This could be dangerous.
Capturing on eth0
  0.00 10.66.42.109 - 10.101.32.104 NTP NTP client
  0.000683 10.101.32.104 - 10.66.42.109 NTP NTP server
1025.000547 10.66.42.109 - 10.101.32.104 NTP NTP client
1025.002017 10.101.32.104 - 10.66.42.109 NTP NTP server
2050.000145 10.66.42.109 - 10.101.32.104 NTP NTP client
2050.000812 10.101.32.104 - 10.66.42.109 NTP NTP server
3076.000485 10.66.42.109 - 10.101.32.104 NTP NTP client
3076.001917 10.101.32.104 - 10.66.42.109 NTP NTP server
4101.001065 10.66.42.109 - 10.101.32.104 NTP NTP client
4101.002016 10.101.32.104 - 10.66.42.109 NTP NTP server
5126.000671 10.66.42.109 - 10.101.32.104 NTP NTP client
5126.001551 10.101.32.104 - 10.66.42.109 NTP NTP server
6151.001270 10.66.42.109 - 10.101.32.104 NTP NTP client
6151.001868 10.101.32.104 - 10.66.42.109 NTP NTP server
7175.999866 10.66.42.109 - 10.101.32.104 NTP NTP client
7176.000482 10.101.32.104 - 10.66.42.109 NTP NTP server
8201.999206 10.66.42.109 - 10.101.32.104 NTP NTP client
8202.000563 10.101.32.104 - 10.66.42.109 NTP NTP server
9227.000796 10.66.42.109 - 10.101.32.104 NTP NTP client
9227.001462 10.101.32.104 - 10.66.42.109 NTP NTP server

Does this make any sense?

Alfred

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


[CentOS] Strange NTP problem

2008-05-20 Thread Alfred von Campe
I have 30 identical Lenovo desktop systems running CentOS 5.1.  On  
one of those systems the clock is running slow (5+ minutes from  
yesterday to this morning and another minute since this morning)  
despite the fact that NTP is running on all of them and they all have  
the exact same /etc/ntp.conf file (I compared the MD5 sums of that  
file on all the systems).  Here is the output of grep ntp /var/log  
messages on the system with the problem since I restarted the NTP  
daemon earlier today:


May 20 11:35:38 hepdsw03 ntpd[31791]: ntpd [EMAIL PROTECTED] Sat Nov  
10 12:33:50 UTC 2007 (1)

May 20 11:35:38 hepdsw03 ntpd[31792]: precision = 1.000 usec
May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface  
wildcard, 0.0.0.0#123 Disabled
May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface  
wildcard, ::#123 Disabled
May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface lo, :: 
1#123 Enabled
May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface eth0,  
fe80::210:c6ff:feab:dd92#123 Enabled
May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface lo,  
127.0.0.1#123 Enabled
May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface eth0,  
10.66.42.109#123 Enabled

May 20 11:35:38 hepdsw03 ntpd[31792]: kernel time sync status 0040
May 20 11:35:38 hepdsw03 ntpd[31792]: frequency initialized 0.000 PPM  
from /var/lib/ntp/drift
May 20 11:38:55 hepdsw03 ntpd[31792]: synchronized to LOCAL(0),  
stratum 10

May 20 11:38:55 hepdsw03 ntpd[31792]: kernel time sync disabled 0001
May 20 11:39:59 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104,  
stratum 3
May 20 11:40:58 hepdsw03 ntpd[31792]: synchronized to LOCAL(0),  
stratum 10
May 20 11:42:09 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104,  
stratum 3
May 20 11:47:26 hepdsw03 ntpd[31792]: synchronized to LOCAL(0),  
stratum 10
May 20 11:49:31 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104,  
stratum 3
May 20 11:52:48 hepdsw03 ntpd[31792]: synchronized to LOCAL(0),  
stratum 10
May 20 11:54:54 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104,  
stratum 3
May 20 12:01:26 hepdsw03 ntpd[31792]: synchronized to LOCAL(0),  
stratum 10


Any idea what could be causing this?  For now I have disabled the NTP  
daemon and am running ntpdate once an hour.  What complicates the  
matter is that we are using Kerberos for authentication, and after a  
day or so the user can not log into his system anymore because of the  
time skew.


Alfred

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Strange NTP problem

2008-05-20 Thread Paul Heinlein

On Tue, 20 May 2008, Alfred von Campe wrote:

I have 30 identical Lenovo desktop systems running CentOS 5.1.  On 
one of those systems the clock is running slow (5+ minutes from 
yesterday to this morning and another minute since this morning) 
despite the fact that NTP is running on all of them and they all 
have the exact same /etc/ntp.conf file (I compared the MD5 sums of 
that file on all the systems).  Here is the output of grep ntp 
/var/log messages on the system with the problem since I restarted 
the NTP daemon earlier today:


A slew of 5 min/24 hrs should be in the range of fixable.

May 20 11:35:38 hepdsw03 ntpd[31792]: frequency initialized 0.000 
PPM from /var/lib/ntp/drift


This is very suspect. Are there any SELinux or other log messages 
suggesting that ntpd isn't able to write to its drift file? Your local 
clock is definitely drifting, so a 0.000 value is bogus. It may 
indicate that there's a disconnect between ntpd and the filesystem.


I'd be interested in the output of ntpdc -c kerninfo; on most 
systems the 'pll frequency' value is a close match to the figure in 
the drift file.


May 20 11:38:55 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), 
stratum 10

May 20 11:38:55 hepdsw03 ntpd[31792]: kernel time sync disabled 0001
May 20 11:39:59 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104, 
stratum 3


This is ungood. Sync-ing to local before your network time server 
means that your machine doesn't want to believe your server -- and you 
should see a kernel time sync enabled message once the machine has 
sync-ed with the time server.


You said the machines are identical. Could there be any variation in 
the BIOS revision level or its settings? Sometimes ACPI stuff can mess 
up ntp.


Also -- the log messages you provide have no step time server 
reference. Do you have a valid /etc/ntp/step-tickers file?


--
Paul Heinlein  [EMAIL PROTECTED]  http://www.madboa.com/
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Strange NTP problem

2008-05-20 Thread Filipe Brandenburger
On Tue, May 20, 2008 at 3:46 PM, Alfred von Campe [EMAIL PROTECTED] wrote:
 I have 30 identical Lenovo desktop systems running CentOS 5.1.  On one of
 those systems the clock is running slow (5+ minutes from yesterday to this
 morning and another minute since this morning) despite the fact that NTP is
 running on all of them and they all have the exact same /etc/ntp.conf file
 (I compared the MD5 sums of that file on all the systems).  Here is the
 output of grep ntp /var/log messages on the system with the problem since
 I restarted the NTP daemon earlier today:

Hi Alfred,

What is the output of ntpq -np ? You should have a line with a star
(*), otherwise it is not synchronizing. Start running NTP again, wait
for half an hour and issue that command to see what your output is. It
could be a problem related to filtering UDP traffic, either with
iptables on the machine itself, or some other router in the network
your machine is in.

HTH,
Filipe
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Strange NTP problem

2008-05-20 Thread Alfred von Campe

On May 20, 2008, at 16:56, Paul Heinlein wrote:


A slew of 5 min/24 hrs should be in the range of fixable.


If the NTP daemon was doing its job :-).

This is very suspect. Are there any SELinux or other log messages  
suggesting that ntpd isn't able to write to its drift file? Your  
local clock is definitely drifting, so a 0.000 value is bogus. It  
may indicate that there's a disconnect between ntpd and the  
filesystem.


I grep'ed for ntp in the /var/log/messages file and there were no  
other instances of ntp.  SELinux is disabled, and all my systems are  
built with Kickstart from the same template file.  I'm fairly  
confident that the other 29 systems are configured identically to  
each other.


I'd be interested in the output of ntpdc -c kerninfo; on most  
systems the 'pll frequency' value is a close match to the figure in  
the drift file.


# ntpdc -c kerninfo
pll offset:   0 s
pll frequency:0.000 ppm
maximum error:0.02724 s
estimated error:  0 s
status:   0001  pll
pll time constant:6
precision:1e-06 s
frequency tolerance:  512 ppm

On a system where it's working, it looks like this:

# ntpdc -c kerninfo
pll offset:   0.011131 s
pll frequency:81.440 ppm
maximum error:0.241978 s
estimated error:  0.005287 s
status:   0001  pll
pll time constant:4
precision:1e-06 s
frequency tolerance:  512 ppm

I re-started the NTP daemon around 5:30 this afternoon, and the clock  
is already off by 45 seconds.  Here are the latest entries from /var/ 
log/messages:


May 20 17:25:25 hepdsw03 ntpd[4225]: ntpd [EMAIL PROTECTED] Sat Nov 10  
12:33:50 UTC 2007 (1)

May 20 17:25:25 hepdsw03 ntpd[4226]: precision = 1.000 usec
May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface wildcard,  
0.0.0.0#123 Disabled
May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface  
wildcard, ::#123 Disabled
May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface lo, :: 
1#123 Enabled
May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface eth0,  
fe80::210:c6ff:feab:dd92#123 Enabled
May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface lo,  
127.0.0.1#123 Enabled
May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface eth0,  
10.66.42.109#123 Enabled

May 20 17:25:25 hepdsw03 ntpd[4226]: kernel time sync status 0040
May 20 17:25:25 hepdsw03 ntpd[4226]: frequency initialized 0.000 PPM  
from /var/lib/ntp/drift
May 20 17:28:37 hepdsw03 ntpd[4226]: synchronized to LOCAL(0),  
stratum 10

May 20 17:28:37 hepdsw03 ntpd[4226]: kernel time sync disabled 0001
May 20 17:30:47 hepdsw03 ntpd[4226]: synchronized to 10.101.32.104,  
stratum 3
May 20 17:35:04 hepdsw03 ntpd[4226]: synchronized to LOCAL(0),  
stratum 10
May 20 17:37:13 hepdsw03 ntpd[4226]: synchronized to 10.101.32.104,  
stratum 3
May 20 17:38:18 hepdsw03 ntpd[4226]: synchronized to LOCAL(0),  
stratum 10


In comparison, here are the entries from a system where NTP is  
working as expected:


May 20 14:36:51 balboa01 ntpd[3374]: ntpd [EMAIL PROTECTED] Sat Nov 10  
12:33:50 UTC 2007 (1)

May 20 14:36:51 balboa01 ntpd[3375]: precision = 1.000 usec
May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface wildcard,  
0.0.0.0#123 Disabled
May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface  
wildcard, ::#123 Disabled
May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface lo, :: 
1#123 Enabled
May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface eth0,  
fe80::21a:6bff:fe46:33d1#123 Enabled
May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface lo,  
127.0.0.1#123 Enabled
May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface eth0,  
10.66.43.100#123 Enabled

May 20 14:36:51 balboa01 ntpd[3375]: kernel time sync status 0040
May 20 14:40:06 balboa01 ntpd[3375]: synchronized to LOCAL(0),  
stratum 10
May 20 14:41:10 balboa01 ntpd[3375]: synchronized to 10.101.32.104,  
stratum 3

May 20 15:00:26 balboa01 ntpd[3375]: time reset -1.533233 s
May 20 15:00:26 balboa01 ntpd[3375]: kernel time sync enabled 0001
May 20 15:03:45 balboa01 ntpd[3375]: synchronized to LOCAL(0),  
stratum 10
May 20 15:05:20 balboa01 ntpd[3375]: synchronized to 10.101.32.104,  
stratum 3


You said the machines are identical. Could there be any variation  
in the BIOS revision level or its settings? Sometimes ACPI stuff  
can mess up ntp.


Yes, there may be some BIOS revision differences, but I'm pretty sure  
that at least one of the other 29 systems has an identical BIOS  
revision.


Also -- the log messages you provide have no step time server  
reference. Do you have a valid /etc/ntp/step-tickers file?


The files exists but is empty on all my systems.

Alfred

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Strange NTP problem

2008-05-20 Thread Filipe Brandenburger
On Tue, May 20, 2008 at 8:53 PM, Alfred von Campe [EMAIL PROTECTED] wrote:
 # ntpq -np
 remote   refid  st t when poll reach   delay   offset
  jitter
 ==
  10.101.32.104   67.128.71.65 3 u  689 1024  3770.659  54095.7
 4263.68
 *127.127.1.0 .LOCL.  10 l   28   64  3770.0000.000
 0.001

 The bad system is synch'ing with itself (which was apparent from the
 messages in /var/log/messages).  The question is why?

When the offset is too big, NTP will discard the server as a source of
synchronization. Try stoping ntpd, setting the date with ntpdate, then
starting ntpq -np again to see what happens.

I saw that the jitter appears to be really big too. That might be
caused by your server going too slow or too fast in a rate that NTP
couldn't keep up to. Once it happened to me, somehow (IRQ problems?)
one machine started going so fast that NTP couldn't keep up with it,
so I had to put periodic ntpdate's to keep it at the right time. I
rebooted it off-hours and the problem went away. Did you try rebooting
this machine and seeing if it fixes the problem?

One good side is the reach = 377, that means your network is good
and all NTP packets are being exchanged correctly. 377 is octal for
binary , which means the last 8 NTP packet exchanges were
successful.

Good luck!
Filipe
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Strange NTP problem

2008-05-20 Thread Paul Heinlein

On Tue, 20 May 2008, Alfred von Campe wrote:


Bad system:

# ntpq -np
   remote   refid  st t when poll reach   delay   offset  jitter
==
10.101.32.104   67.128.71.65 3 u  689 1024  3770.659  54095.7 4263.68


Yeah, with an offset of 54 seconds, it's a bad system. :-)

Try this (assuming 10.101.32.104 is your preferred local NTP server):

  service ntpd stop
  echo 10.101.32.104  /etc/ntp/step-tickers
  service ntpd start

Adding a server to the step-tickers file will tell the ntpd init 
script to do an ntpdate sync against that host before starting ntpd.


--
Paul Heinlein  [EMAIL PROTECTED]  www.madboa.com
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos