I've tried setting the Hz down to 100, but it didn't help -- and seems to have caused other timing problems, but I may revisit that at some point. I do believe the problem is lost interrupts, but I don't know how to be sure without hacking the firewire driver and trying to determine how long it is in there. And if I found out, I'm not sure what I could do.
At the moment I'm trying to get it to simply step/adjust more often. I've adjusted minpoll/maxpoll to 4 (to get 16 secs) and I understand that the default step threshold is 128ms. However I still see offsets that are larger than that, and the 'steps' only occur every 20 or 30 minutes. How can I have it happen more often? other comments below: On Fri, 02 Dec 2005 15:13:31 -0600 [EMAIL PROTECTED] (Hal Murray) wrote: > In article <[EMAIL PROTECTED]>, > [EMAIL PROTECTED] (Bob Robison) writes: > >I'm running a moderate number (around 50) dual-opterons that are > >diskless booting a Linux 2.6.12 smp kernel and trying to synch with a > >Symmetricon XLI-GPS stratum-1 NTP server on an isolated network. > > > >The problem I have is that when I run "ntpq -c peers" on a number of > >these machines to check the status of the ntp synchronization, I see > >offsets ranging over almost 1000 msecs. If I grep through > >the /var/log/ messages file, I see that there are often messages > >around every 20 minutes like this: > > > >Dec 1 20:30:28 (none) ntpd[27203]: time reset 0.613771 s > >Dec 1 20:30:28 (none) ntpd[27203]: synchronisation lost > >Dec 1 20:50:45 (none) ntpd[27203]: time reset 0.931388 s > >Dec 1 20:50:45 (none) ntpd[27203]: synchronisation lost > >Dec 1 21:19:23 (none) ntpd[27203]: time reset 0.451491 s > >Dec 1 21:19:23 (none) ntpd[27203]: synchronisation lost > >Dec 1 21:36:24 (none) ntpd[27203]: time reset 0.391510 s > >Dec 1 21:36:24 (none) ntpd[27203]: synchronisation lost > > Somebody else suggested lost interrupts. That would be pretty > high on my list. > > What happens if you let one of the systems just sit there without > doing anything? If it keeps good time your problem is > probably caused by your normal workload. I need to try this... haven't done that yet because of coordination with other things going on in the system. Will move this up on the priority list. --->>>> Tried this before sending email: Still gets off, even with nothing happening on system.... more confused now.... > > > > Probably the main issue is the CPU and I/O loading on these opteron > > machines. They are each handling streaming data from a firewire > > card (IEEE-1394a) and the CPUs stay fairly busy handling that data > > -- though they are not pegged at 100% or anything. > > The issue is not so much if you are using all the CPU, but if the > clock update interrupt routine is being locked out long enough to > miss an interrupt because the second one comes in before the first > one has been processed. Yes.. I understand. However, I've seen references to 'too many lost ticks' error messages in the kernel logs, but I never see these. So, I'm not sure why not. > > If I was trying to understand this, I'd consider patching the > firewire interrupt routine to turn on a printer port bit at the > start and turn it off at the end, and then put a scope on that > pin to see how long it was on. Most modern (digital) scopes > have a trigger on X longer then Y mode that will show you the > bad cases. > > Or do it all in software by grabbing the cycle counter and making > a histogram. I may have to do this... but holding off if I can. bob _______________________________________________ questions mailing list [email protected] https://lists.ntp.isc.org/mailman/listinfo/questions
