Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64
[I removed [EMAIL PROTECTED] from CC to avoid bounces due to not being subscribed] On Sa Mai 3 2008, John Hasler wrote: Please test this patch to ntp_core.c on a pristine upstream 1.23: --- ../pristine/chrony-1.23/ntp_core.c 2008-05-02 22:14:21.0 -0500 It seems that these timevals do not always end up in offset_time. A $ grep offset_time *.c reveals that offset_time is set (all other occurences of offset_time are either reads or modifications through UTI_* functions) in exactly one place: line [EMAIL PROTECTED], within SST_DoNewRegression Inspection of the surrounding code shows that the assignment depends on condition 'regression_ok'. There is no assignment in the else block at line 489. I confirmed with the debugger that this spot is reached before any reads/modifications take place, so this would be one place to put a fix. I have no idea what would be a good replacement value. Looking at the places calling SST_DoNewRegression and others it seems possible that enough samples can be dropped that regression_ok becomes false after it has been true before. In that case inst-sample_times[inst-n_samples - 1] might be better than {0, 0} if n_samples 0. The canonical place to initialize offset_time = {0, 0} would be SST_CreateInstance. Best regards, Peter Pöschl -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64
On Sa Mai 3 2008, John Hasler wrote: Please test this patch to ntp_core.c on a pristine upstream 1.23: I'm not quite sure what you mean with 'pristine upstream'. I applied the patch to the 1.23-3 sources from Debian unstable. --- ../pristine/chrony-1.23/ntp_core.c 2008-05-02 22:14:21.0 -0500 +++ ntp_core.c 2008-05-02 22:14:56.0 -0500 @@ -320,6 +320,8 @@ result-local_rx.tv_sec = 0; result-local_rx.tv_usec = 0; + result-local_tx.tv_sec = 0; + result-local_tx.tv_usec = 0; return result; The watchpoint with sources[0]-stats.offset_time.tv_sec0x now triggers at main () at main.c:304 SCH_MainLoop () at sched.c:470 read_from_socket () at ntp_io.c:215 NSR_ProcessReceive () at ntp_sources.c:258 receive_packet () at ntp_core.c:1064 SRC_SelectSource () sources.c:695 REF_SetReference () at reference.c:408 LCL_AccumulateOffset () at local.c:446 slew_sources () at sources.c:763 SST_SlewSamples () at sourcestats.c:698 UTI_NormaliseTimeval () at util.c:93 I had to apply this patch --- sources.c.orig Thu May 01 10:38:40 2008 +0200 +++ sources.c Sun May 04 21:27:10 2008 +0200 @@ -136,9 +136,11 @@ max_n_sources = 0; selected_source_index = INVALID_SOURCE; initialised = 1; + static volatile int dbg_is_connected = 0; LCL_AddParameterChangeHandler(slew_sources, NULL); + while (!dbg_is_connected) ; return; } to reproduce the bug. It disappears when I start 'chronyd -d' from within the debugger. Best regards, Peter Pöschl -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64
Please test this patch to ntp_core.c on a pristine upstream 1.23: --- ../pristine/chrony-1.23/ntp_core.c 2008-05-02 22:14:21.0 -0500 +++ ntp_core.c 2008-05-02 22:14:56.0 -0500 @@ -320,6 +320,8 @@ result-local_rx.tv_sec = 0; result-local_rx.tv_usec = 0; + result-local_tx.tv_sec = 0; + result-local_tx.tv_usec = 0; return result; -- John Hasler [EMAIL PROTECTED] Elmwood, WI USA -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64
BTW please cc: [EMAIL PROTECTED] so that this discussion is on record. Oh, sorry, I thought that was a bad idea after the bug was closed. The ridiculously high values come from the line sources = MallocArray(struct SRC_Instance_Record *, max_n_sources); in SRC_CreateNewInstance in sources.c. I put a watchpoint on sources[0]-stats.offset_time.tv_sec with condition ' 0x'. It never triggered when at the calltree main () at main.c:304 SCH_MainLoop () at sched.c:470 read_from_socket () at ntp_io.c:215 NSR_ProcessReceive () at ntp_sources.c:258 receive_packet () at ntp_core.c:1063 SRC_SelectSource () sources.c:693 REF_SetReference () at reference.c:408 LCL_AccumulateOffset () at local.c:446 slew_sources () at sources.c:761 sources is defined static in this file. SST_SlewSamples () at sourcestats.c:698 parameter inst = sources[1]-stats in caller UTI_DiffTimevalsToDouble () at util.c:161 parameter b = inst-offset_time of caller the value was, for the first time, used to calculate the result of UTI_DiffTimevalsToDouble (line numbers apply to 1.23-2 sources plus the instrumentation patches I sent you off-list). This missing initialization was obviously harmless in the 32-bit version, but you should ask upstream for the implications of a huge random starting value. It might be that with your divider/remainder patch the program won't loop till the sun goes out, but nontheless take an eternity until the system time converges to UTC. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64
The ridiculously high values come from the line sources = MallocArray(struct SRC_Instance_Record *, max_n_sources); in SRC_CreateNewInstance in sources.c. Thank you. I changed the limits in your patch to catch values that would be unreasonable on my 32 bit system but I couldn't get the bug to trigger after I started using gdb (I did see it a few times while I was testing the patch). This missing initialization was obviously harmless in the 32-bit version, but you should ask upstream for the implications of a huge random starting value. It might be that with your divider/remainder patch the program won't loop till the sun goes out, but nontheless take an eternity until the system time converges to UTC. Occasionally I see Residual freq : -32768.000 ppm in the chronyc Tracking display at startup (it goes away after a few minutes). It occurred to me yesterday that it might be associated with this bug. Now I'm fairly sure it's the system time converging from such a random starting value. -- John Hasler [EMAIL PROTECTED] Elmwood, WI USA -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64
Peter Pöschl writes: When I single-step through main() I loose the interesting process in LOG_GoDaemon(). Is there a way to tell the debugger that I want to trace the child process ponafter the fork? Starting chrony with the -d option will prevent the fork. Chronyd will then remain attached to the terminal and send all messages there. BTW please cc: [EMAIL PROTECTED] so that this discussion is on record. -- John Hasler [EMAIL PROTECTED] Elmwood, WI USA -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
On Thu, Apr 24, 2008 at 02:21:39PM -0500, John Hasler wrote: Gabor writes: That will be difficult since sometimes the bug does not hit for weeks and then suddenly chrony starts to loop all the time. Are you saying that you have seen the bug? Yes, see bug #447011. In fact, #474294 is a duplicate of #447011... So I'd say go ahead and upload the new version to unstable, and if there are no new occurances of the bug for 1-2 months then you can close it. Which would probably result in Chrony being removed from Lenny. Well, then someone should start debugging it. The gdb trace sent by Goshwin is quite promising. If UTI_NormaliseTimeval() is called with x-tv_usec being a very large value (say LONG_MAX), that would clearly explain the hang, and it would also explain why i386 does not seem to be affected even if it is just as buggy as amd64: on i386, the while {} loops execute at most 2147 times which is basically unnoticable, while on amd64 that can be 2^32 times more. So, IMHO turning the two while {} loops in UTI_NormaliseTimeval() into divide/remainder operations should fix the hang. However, it still needs investigation _why_ UTI_NormaliseTimeval() is being called with such a bad time value, as it may be a result of a more severe bug like memory corruption. Maybe upstream could help here. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences, Laboratory of Parallel and Distributed Systems Address : H-1132 Budapest Victor Hugo u. 18-22. Hungary Phone/Fax : +36 1 329-78-64 (secretary) W3: http://www.lpds.sztaki.hu - -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
Well, then someone should start debugging it. I'm trying, but lacking 64-bit hardware I am reduced to debugging by inspection. If UTI_NormaliseTimeval() is called with x-tv_usec being a very large value (say LONG_MAX), that would clearly explain the hang, and it would also explain why i386 does not seem to be affected even if it is just as buggy as amd64: on i386, the while {} loops execute at most 2147 times which is basically unnoticable, while on amd64 that can be 2^32 times more. Thank you: I think that's it. I looked at the loop right away, but I couldn't see how it could be getting stuck. It isn't: it's just looping until the sun goes out. So, IMHO turning the two while {} loops in UTI_NormaliseTimeval() into divide/remainder operations should fix the hang. However, it still needs investigation _why_ UTI_NormaliseTimeval() is being called with such a bad time value, as it may be a result of a more severe bug like memory corruption. I will make those changes and also look some more for the source of the large value (probably yet another LP64 bug). Maybe upstream could help here. I already forwarded the bug of course, but upstream is not very active. -- John Hasler [EMAIL PROTECTED] Elmwood, WI USA -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
Gabor writes: That will be difficult since sometimes the bug does not hit for weeks and then suddenly chrony starts to loop all the time. Are you saying that you have seen the bug? So I'd say go ahead and upload the new version to unstable, and if there are no new occurances of the bug for 1-2 months then you can close it. Which would probably result in Chrony being removed from Lenny. -- John Hasler [EMAIL PROTECTED] Elmwood, WI USA -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
From: John Hasler [EMAIL PROTECTED] Subject: RFH: Chrony goes into endless loop on x86_64 To: [EMAIL PROTECTED] Date: Tue, 22 Apr 2008 13:50:43 -0500 Organization: Dancing Horse Hill See bug #474294. If you have an x86_64 system you can help by a) installing chrony-1.21 from Stable or Unstable and confirming the bug or b) installing chrony-1.23 from Experimental and determining if the new upstream release has fixed it. You could also look at the bug report and the source and help me find the problem. I've had no response from upstream. -- John Hasler -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 John Hasler wrote: See bug #474294. If you have an x86_64 system you can help by a) installing chrony-1.21 from Stable or Unstable and confirming the bug or b) installing chrony-1.23 from Experimental and determining if the new upstream release has fixed it. You could also look at the bug report and the source and help me find the problem. I've had no response from upstream. I have amd64 debian lenny/sid system. Have installed chrony 1.21z - bug doesn't appear. Package works. Have installed chrony 1.23 - bug doesn't appear too. Package works. Saw no any endless loops. - -- Eugene V. Lyubimkin aka JackYF, Ukrainian C++ developer. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFIDkcxchorMMFUmYwRAoB0AJiTTmSphYN6oOZzSNWwpnMYTGtvAKCPaUg7 4P0BwTbF5lrYmB+fMXzwcg== =eRpF -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
John Hasler wrote: See bug #474294. If you have an x86_64 system you can help by a) installing chrony-1.21 from Stable or Unstable and confirming the bug I have been running chrony on my amd64 box without ever seeing any problems with kernels from 2.6.18 to 2.6.25. I use two local (home network) time servers as base. Feel free to contact me if you'd like additional info. Cheers, FJP signature.asc Description: This is a digitally signed message part.
Bug#474294: RFH: Chrony goes into endless loop on x86_64
I have been running chrony on my amd64 box without ever seeing any problems with kernels from 2.6.18 to 2.6.25. I use two local (home network) time servers as base. Thank you. That is very useful information. -- John Hasler [EMAIL PROTECTED] Elmwood, WI USA -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Goswin von Brederlow wrote: Eugene V. Lyubimkin [EMAIL PROTECTED] writes: John Hasler wrote: See bug #474294. If you have an x86_64 system you can help by a) installing chrony-1.21 from Stable or Unstable and confirming the bug or b) installing chrony-1.23 from Experimental and determining if the new upstream release has fixed it. You could also look at the bug report and the source and help me find the problem. I've had no response from upstream. I have amd64 debian lenny/sid system. Have installed chrony 1.21z - bug doesn't appear. Package works. Have installed chrony 1.23 - bug doesn't appear too. Package works. Saw no any endless loops. How often did you start it? It doesn't always appear. I have no idea what triggered it here but some days it does, other days I can't reproduce it. MfG Goswin I started/stopped it ~5 times every 2-10 minutes. - -- Eugene V. Lyubimkin aka JackYF, Ukrainian C++ developer. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIDlzIchorMMFUmYwRAn6WAKCh8dvYQaJcYlM8ItUZIE5s/LboDQCfczwS 8/sPCrCQ2JT0/k3t5q+AEUs= =7Iq2 -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
Eugene V. Lyubimkin [EMAIL PROTECTED] writes: John Hasler wrote: See bug #474294. If you have an x86_64 system you can help by a) installing chrony-1.21 from Stable or Unstable and confirming the bug or b) installing chrony-1.23 from Experimental and determining if the new upstream release has fixed it. You could also look at the bug report and the source and help me find the problem. I've had no response from upstream. I have amd64 debian lenny/sid system. Have installed chrony 1.21z - bug doesn't appear. Package works. Have installed chrony 1.23 - bug doesn't appear too. Package works. Saw no any endless loops. How often did you start it? It doesn't always appear. I have no idea what triggered it here but some days it does, other days I can't reproduce it. MfG Goswin -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
Frans Pop writes: I have been running chrony on my amd64 box without ever seeing any problems with kernels from 2.6.18 to 2.6.25. Could you try stopping and starting it a few times? The bug manifests only intermittently and only on startup. -- John Hasler [EMAIL PROTECTED] Elmwood, WI USA -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#474294: RFH: Chrony goes into endless loop on x86_64
On Wednesday 23 April 2008, John Hasler wrote: Frans Pop writes: I have been running chrony on my amd64 box without ever seeing any problems with kernels from 2.6.18 to 2.6.25. Could you try stopping and starting it a few times? The bug manifests only intermittently and only on startup. I use it on all my machines, which includes 3 Pentium boxes, a sparc64 box, a hppa box and an arm system. I have never seen this on any of them. The amd64 system I use it on is my desktop, so it's booted daily and when I'm doing kernel bisects, multiple times per day. I've never seen anything like it (and as I have a CPU monitor on my desktop, I would have). I suspect that the problem may be hardware related. Maybe some interaction with the RTC, which would also explain why it only manifests on startup as that is when chrony reads the RTC. Another option could be that it's somehow related to timekeeping in the kernel (but probably still influenced by the exact hardware). Those are only fairly uneducated guesses though. Cheers, FJP -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]