Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-05-18 Thread Peter Pöschl
[I removed [EMAIL PROTECTED] from CC to avoid bounces due to not being 
subscribed]

On Sa Mai 3 2008, John Hasler wrote:
 Please test this patch to ntp_core.c on a pristine upstream 1.23:


 --- ../pristine/chrony-1.23/ntp_core.c  2008-05-02 22:14:21.0 -0500

It seems that these timevals do not always end up in offset_time. A

  $ grep offset_time *.c

reveals that offset_time is set (all other occurences of offset_time are 
either reads or modifications through UTI_* functions) in exactly one place:

  line [EMAIL PROTECTED], within SST_DoNewRegression

Inspection of the surrounding code shows that the assignment depends on 
condition 'regression_ok'.

There is no assignment in the else block at line 489. I confirmed with the 
debugger that this spot is reached before any reads/modifications take place, 
so this would be one place to put a fix. I have no idea what would be a good 
replacement value.
Looking at the places calling SST_DoNewRegression and others it seems possible 
that enough samples can be dropped that regression_ok becomes false after it 
has been true before. In that case inst-sample_times[inst-n_samples - 1] 
might be better than {0, 0} if n_samples  0. 

The canonical place to initialize offset_time = {0, 0} would be 
SST_CreateInstance.


Best regards,

  Peter Pöschl





--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-05-04 Thread Peter Pöschl
On Sa Mai 3 2008, John Hasler wrote:
 Please test this patch to ntp_core.c on a pristine upstream 1.23:
I'm not quite sure what you mean with 'pristine upstream'.
I applied the patch to the 1.23-3 sources from Debian unstable.


 --- ../pristine/chrony-1.23/ntp_core.c  2008-05-02 22:14:21.0 -0500
 +++ ntp_core.c  2008-05-02 22:14:56.0 -0500
 @@ -320,6 +320,8 @@

result-local_rx.tv_sec = 0;
result-local_rx.tv_usec = 0;
 +  result-local_tx.tv_sec = 0;
 +  result-local_tx.tv_usec = 0;

return result;

The watchpoint with sources[0]-stats.offset_time.tv_sec0x
now triggers at
  main () at main.c:304
  SCH_MainLoop () at sched.c:470
  read_from_socket () at ntp_io.c:215
  NSR_ProcessReceive () at ntp_sources.c:258
  receive_packet () at ntp_core.c:1064
  SRC_SelectSource () sources.c:695
  REF_SetReference () at reference.c:408
  LCL_AccumulateOffset () at local.c:446
  slew_sources () at sources.c:763
  SST_SlewSamples () at sourcestats.c:698
  UTI_NormaliseTimeval () at util.c:93


I had to apply this patch 

--- sources.c.orig Thu May 01 10:38:40 2008 +0200
+++ sources.c Sun May 04 21:27:10 2008 +0200
@@ -136,9 +136,11 @@
   max_n_sources = 0;
   selected_source_index = INVALID_SOURCE;
   initialised = 1;
+  static volatile int dbg_is_connected = 0;

   LCL_AddParameterChangeHandler(slew_sources, NULL);

+  while (!dbg_is_connected) ;
   return;
 }


to reproduce the bug. It disappears when I start 'chronyd -d' from within the 
debugger.


Best regards,

  Peter Pöschl



--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-05-02 Thread John Hasler
Please test this patch to ntp_core.c on a pristine upstream 1.23:


--- ../pristine/chrony-1.23/ntp_core.c  2008-05-02 22:14:21.0 -0500
+++ ntp_core.c  2008-05-02 22:14:56.0 -0500
@@ -320,6 +320,8 @@
 
   result-local_rx.tv_sec = 0;
   result-local_rx.tv_usec = 0;
+  result-local_tx.tv_sec = 0;
+  result-local_tx.tv_usec = 0;
 
   return result;
 


-- 
John Hasler 
[EMAIL PROTECTED]
Elmwood, WI USA



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-29 Thread Peter Pöschl
 BTW please cc: [EMAIL PROTECTED] so that this discussion is on record.
Oh, sorry, I thought that was a bad idea after the bug was closed.


The ridiculously high values come from the line

  sources = MallocArray(struct SRC_Instance_Record *, max_n_sources);

in SRC_CreateNewInstance in sources.c.

I put a watchpoint on
  sources[0]-stats.offset_time.tv_sec
with condition ' 0x'.
It never triggered when at the calltree
  main () at main.c:304
  SCH_MainLoop () at sched.c:470
  read_from_socket () at ntp_io.c:215
  NSR_ProcessReceive () at ntp_sources.c:258
  receive_packet () at ntp_core.c:1063
  SRC_SelectSource () sources.c:693
  REF_SetReference () at reference.c:408
  LCL_AccumulateOffset () at local.c:446
  slew_sources () at sources.c:761
   sources is defined static in this file.
  SST_SlewSamples () at sourcestats.c:698
   parameter inst = sources[1]-stats in caller
  UTI_DiffTimevalsToDouble () at util.c:161
   parameter b = inst-offset_time of caller

the value was, for the first time, used to calculate the result of 
UTI_DiffTimevalsToDouble (line numbers apply to 1.23-2 sources plus the 
instrumentation patches I sent you off-list).

This missing initialization was obviously harmless in the 32-bit version, but 
you should ask upstream for the implications of a huge random starting value.
It might be that with your divider/remainder patch the program won't loop till 
the sun goes out, but nontheless take an eternity until the system time 
converges to UTC.







-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-29 Thread John Hasler
 The ridiculously high values come from the line

  sources = MallocArray(struct SRC_Instance_Record *, max_n_sources);

 in SRC_CreateNewInstance in sources.c.

Thank you.  I changed the limits in your patch to catch values that would
be unreasonable on my 32 bit system but I couldn't get the bug to trigger
after I started using gdb (I did see it a few times while I was testing the
patch).

 This missing initialization was obviously harmless in the 32-bit version,
 but you should ask upstream for the implications of a huge random
 starting value.  It might be that with your divider/remainder patch the
 program won't loop till the sun goes out, but nontheless take an eternity
 until the system time converges to UTC.

Occasionally I see Residual freq : -32768.000 ppm in the chronyc
Tracking display at startup (it goes away after a few minutes).  It
occurred to me yesterday that it might be associated with this bug.  Now
I'm fairly sure it's the system time converging from such a random starting
value.
-- 
John Hasler 
[EMAIL PROTECTED]
Elmwood, WI USA



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: Moreinfo: Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-27 Thread John Hasler
Peter Pöschl writes:
 When I single-step through main() I loose the interesting process in
 LOG_GoDaemon(). Is there a way to tell the debugger that I want to trace
 the child process ponafter the fork?

Starting chrony with the -d option will prevent the fork.  Chronyd will
then remain attached to the terminal and send all messages there.

BTW please cc: [EMAIL PROTECTED] so that this discussion is on record.
-- 
John Hasler 
[EMAIL PROTECTED]
Elmwood, WI USA



--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-25 Thread Gabor Gombas
On Thu, Apr 24, 2008 at 02:21:39PM -0500, John Hasler wrote:
 Gabor writes:
  That will be difficult since sometimes the bug does not hit for weeks and
  then suddenly chrony starts to loop all the time.
 
 Are you saying that you have seen the bug?

Yes, see bug #447011. In fact, #474294 is a duplicate of #447011...

  So I'd say go ahead and upload the new version to unstable, and if there
  are no new occurances of the bug for 1-2 months then you can close it.
 
 Which would probably result in Chrony being removed from Lenny.

Well, then someone should start debugging it. The gdb trace sent by
Goshwin is quite promising. If UTI_NormaliseTimeval() is called with
x-tv_usec being a very large value (say LONG_MAX), that would clearly
explain the hang, and it would also explain why i386 does not seem to be
affected even if it is just as buggy as amd64: on i386, the while {}
loops execute at most 2147 times which is basically unnoticable, while
on amd64 that can be 2^32 times more.

So, IMHO turning the two while {} loops in UTI_NormaliseTimeval() into
divide/remainder operations should fix the hang. However, it still needs
investigation _why_ UTI_NormaliseTimeval() is being called with such a
bad time value, as it may be a result of a more severe bug like memory
corruption. Maybe upstream could help here.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences,
 Laboratory of Parallel and Distributed Systems
 Address   : H-1132 Budapest Victor Hugo u. 18-22. Hungary
 Phone/Fax : +36 1 329-78-64 (secretary)
 W3: http://www.lpds.sztaki.hu
 -



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-25 Thread John Hasler
 Well, then someone should start debugging it.

I'm trying, but lacking 64-bit hardware I am reduced to debugging by
inspection.

 If UTI_NormaliseTimeval() is called with x-tv_usec being a very large
 value (say LONG_MAX), that would clearly explain the hang, and it would
 also explain why i386 does not seem to be affected even if it is just as
 buggy as amd64: on i386, the while {} loops execute at most 2147 times
 which is basically unnoticable, while on amd64 that can be 2^32 times
 more.

Thank you: I think that's it.  I looked at the loop right away, but I
couldn't see how it could be getting stuck.  It isn't: it's just looping
until the sun goes out.

 So, IMHO turning the two while {} loops in UTI_NormaliseTimeval() into
 divide/remainder operations should fix the hang. However, it still needs
 investigation _why_ UTI_NormaliseTimeval() is being called with such a
 bad time value, as it may be a result of a more severe bug like memory
 corruption.

I will make those changes and also look some more for the source of the
large value (probably yet another LP64 bug).

 Maybe upstream could help here.

I already forwarded the bug of course, but upstream is not very active.
-- 
John Hasler 
[EMAIL PROTECTED]
Elmwood, WI USA



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-24 Thread John Hasler
Gabor writes:
 That will be difficult since sometimes the bug does not hit for weeks and
 then suddenly chrony starts to loop all the time.

Are you saying that you have seen the bug?

 So I'd say go ahead and upload the new version to unstable, and if there
 are no new occurances of the bug for 1-2 months then you can close it.

Which would probably result in Chrony being removed from Lenny.
-- 
John Hasler 
[EMAIL PROTECTED]
Elmwood, WI USA



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread jhasler
From: John Hasler [EMAIL PROTECTED]
Subject: RFH: Chrony goes into endless loop on x86_64
To: [EMAIL PROTECTED] 
Date: Tue, 22 Apr 2008 13:50:43 -0500
Organization: Dancing Horse Hill

See bug #474294.

If you have an x86_64 system you can help by

 a) installing chrony-1.21 from Stable or Unstable and confirming the bug

or

 b) installing chrony-1.23 from Experimental and determining if the new
upstream release has fixed it.

You could also look at the bug report and the source and help me find the
problem.  I've had no response from upstream.
-- 
John Hasler



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread Eugene V. Lyubimkin
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

John Hasler wrote:
 See bug #474294.
 
 If you have an x86_64 system you can help by
 
  a) installing chrony-1.21 from Stable or Unstable and confirming the bug
 
 or
 
  b) installing chrony-1.23 from Experimental and determining if the new
 upstream release has fixed it.
 
 You could also look at the bug report and the source and help me find the
 problem.  I've had no response from upstream.
I have amd64 debian lenny/sid system. Have installed chrony 1.21z - bug
doesn't appear. Package works. Have installed chrony 1.23 - bug doesn't
appear too. Package works. Saw no any endless loops.

- --
Eugene V. Lyubimkin aka JackYF, Ukrainian C++ developer.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFIDkcxchorMMFUmYwRAoB0AJiTTmSphYN6oOZzSNWwpnMYTGtvAKCPaUg7
4P0BwTbF5lrYmB+fMXzwcg==
=eRpF
-END PGP SIGNATURE-



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread Frans Pop
John Hasler wrote:
 See bug #474294.
 
 If you have an x86_64 system you can help by
  a) installing chrony-1.21 from Stable or Unstable and confirming the bug

I have been running chrony on my amd64 box without ever seeing any problems 
with kernels from 2.6.18 to 2.6.25. I use two local (home network) time 
servers as base.

Feel free to contact me if you'd like additional info.

Cheers,
FJP


signature.asc
Description: This is a digitally signed message part.


Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread John Hasler
 I have been running chrony on my amd64 box without ever seeing any
 problems with kernels from 2.6.18 to 2.6.25. I use two local (home
 network) time servers as base.

Thank you.  That is very useful information.
-- 
John Hasler 
[EMAIL PROTECTED]
Elmwood, WI USA



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread Eugene V. Lyubimkin
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Goswin von Brederlow wrote:
 Eugene V. Lyubimkin [EMAIL PROTECTED] writes:
 
 John Hasler wrote:
 See bug #474294.

 If you have an x86_64 system you can help by

  a) installing chrony-1.21 from Stable or Unstable and confirming the bug

 or

  b) installing chrony-1.23 from Experimental and determining if the new
 upstream release has fixed it.

 You could also look at the bug report and the source and help me find the
 problem.  I've had no response from upstream.
 I have amd64 debian lenny/sid system. Have installed chrony 1.21z - bug
 doesn't appear. Package works. Have installed chrony 1.23 - bug doesn't
 appear too. Package works. Saw no any endless loops.
 
 How often did you start it? It doesn't always appear. I have no idea
 what triggered it here but some days it does, other days I can't
 reproduce it.
 
 MfG
 Goswin
 
I started/stopped it ~5 times every 2-10 minutes.

- --
Eugene V. Lyubimkin aka JackYF, Ukrainian C++ developer.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIDlzIchorMMFUmYwRAn6WAKCh8dvYQaJcYlM8ItUZIE5s/LboDQCfczwS
8/sPCrCQ2JT0/k3t5q+AEUs=
=7Iq2
-END PGP SIGNATURE-



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread Goswin von Brederlow
Eugene V. Lyubimkin [EMAIL PROTECTED] writes:

 John Hasler wrote:
 See bug #474294.
 
 If you have an x86_64 system you can help by
 
  a) installing chrony-1.21 from Stable or Unstable and confirming the bug
 
 or
 
  b) installing chrony-1.23 from Experimental and determining if the new
 upstream release has fixed it.
 
 You could also look at the bug report and the source and help me find the
 problem.  I've had no response from upstream.
 I have amd64 debian lenny/sid system. Have installed chrony 1.21z - bug
 doesn't appear. Package works. Have installed chrony 1.23 - bug doesn't
 appear too. Package works. Saw no any endless loops.

How often did you start it? It doesn't always appear. I have no idea
what triggered it here but some days it does, other days I can't
reproduce it.

MfG
Goswin



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread John Hasler
Frans Pop writes:
 I have been running chrony on my amd64 box without ever seeing any
 problems with kernels from 2.6.18 to 2.6.25.

Could you try stopping and starting it a few times?  The bug manifests only
intermittently and only on startup.
-- 
John Hasler 
[EMAIL PROTECTED]
Elmwood, WI USA



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#474294: RFH: Chrony goes into endless loop on x86_64

2008-04-22 Thread Frans Pop
On Wednesday 23 April 2008, John Hasler wrote:
 Frans Pop writes:
  I have been running chrony on my amd64 box without ever seeing any
  problems with kernels from 2.6.18 to 2.6.25.

 Could you try stopping and starting it a few times?  The bug manifests
 only intermittently and only on startup.

I use it on all my machines, which includes 3 Pentium boxes, a sparc64 box, 
a hppa box and an arm system. I have never seen this on any of them.

The amd64 system I use it on is my desktop, so it's booted daily and when 
I'm doing kernel bisects, multiple times per day. I've never seen anything 
like it (and as I have a CPU monitor on my desktop, I would have).

I suspect that the problem may be hardware related. Maybe some interaction 
with the RTC, which would also explain why it only manifests on startup as 
that is when chrony reads the RTC.
Another option could be that it's somehow related to timekeeping in the 
kernel (but probably still influenced by the exact hardware).

Those are only fairly uneducated guesses though.

Cheers,
FJP



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]