Re: [Linux-HA] [patch] clock_t wrapped around causingfalse resourcestart failure

Dejan Muhamedagic Thu, 02 Jul 2009 02:00:32 -0700

On Wed, Jul 01, 2009 at 05:13:42PM +0200, Lars Ellenberg wrote:
> On Wed, Jul 01, 2009 at 04:59:24PM +0200, Dejan Muhamedagic wrote:
> > Hi Lars,
> > 
> > On Wed, Jul 01, 2009 at 04:44:06PM +0200, Lars Ellenberg wrote:
> > > On Wed, Jul 01, 2009 at 10:22:58AM -0400, Tavanyar, Simon wrote:
> > > > Aahh ... so it's measured after reboot. That makes sense.
> > > > Thanks, Lars.
> > > > 
> > > > Should I wait for you to log a bugzilla on this?
> > > 
> > > bugzilla and me don't go along very well, really ;-)
> > > so please feel free to open that bugzilla,
> > > and attach my proposed patch there.
> > > 
> > > just to reiterate:
> > > the problem is not the uptime wrap or resulting longclock wrap,
> > > as that is handled correctly in cl_times() and time_longclock.
> > > 
> > > but the cmp_longclock comparing unsigned values
> > > without checking for wrap.
> > 
> > The wrap is stuffed into upper bits of the longclock_t variables:
> > 
> >     return (lc_wrapcount | (longclock_t)timesval);
> > 
> > Or am I missing something? The only explanation I can think of is
> > that this is actually wrong:
> 
> absolutely.


Looking again, this is probably not a problem. The nexttime is
always in the future.

> >     if (cmp_longclock(lnow, append->nexttime) >= 0) {
> > 
> > i.e. it should be strictly greater. I guess that what happened is
> > that somehow the prepare function got called fast enough, so that
> > both timestamps ended equal.
> 
> absolutely not.
>  ;)
>
> Much worse!
> as far as I see, cmp_longclock is broken (does not care for wrap)

I still don't understand what do you mean by "does not care for
wrap". In case of 32-bit clock_t the wrapcount is stuffed into
the upper bits of longclock_t which is 64-bit. How can it then
not take it into account? In case of 64-bit clock_t, there is no
wrap. It has enough bits to serve us until the end of the world.

> even on 64bit archs, and the time window for this to happen is about
> as large as the larges action (or else) timeout you set somewhere in
> crm/lrm/whatever.
> 
> did you have a look at my explanation ealier in this thread,
> and the patch? (part of that message pasted here for reference)
> 
> 
>         if (cmp_longclock(lnow, append->nexttime) >= 0)
> assuming that nexttime was set correctly, and lnow is correct, too,
> and further assuming your clock_t is only 32 bit,
> longclock_t is defined as unsigned long long,
> and that thing becomes:
> 
>         if ((unsigned long long)(lnow) >= (unsigned long 
> long)(append->nexttime))
> 
> which exactly does _NOT_ care for wrap around :(
> 
> 
> example:
>       say, you start with a _large_ lnow (e.g. the equivalent of "-15

But you can't start with large lnow. lnow is 64-bit and you can't
get that far into the future.

>       seconds"), and add an intervall of 30 seconds, resulting
>       "nexttime" become the equivalent of +15 seconds.
>       comparing absolute _unsigned_ means that test ends up comparing
>       if (something verry large >= something very small).
>       which is always true :(
> 
>       you need instead cast both values to _signed_, and compare the
>       _difference_ against zero, as below:
> 
> 
> cmp_longclock macro and functions are broken.  conceivably some other
> longclock related macros/functions may be broken as well.
> 
> it should probably be more like
> (caution, not even compile tested)
> ((and this time incompletely copy'n'pasted,
>   so whitespace will be wrong))
> 
> diff -r 731f8f7b5450 include/clplumbing/longclock.h
> --- a/include/clplumbing/longclock.h    Tue Jun 30 12:02:16 2009 +0200
> +++ b/include/clplumbing/longclock.h    Wed Jul 01 12:27:27 2009 +0200
> @@ -127,9 +127,9 @@
>         ((longclock_t)(l) - (longclock_t)(r))
> 
>  #      define  cmp_longclock(l,r)                      \
> -       (((longclock_t)(l) < (longclock_t)(r))          \
> +       ((((long long)(l) - (long long)(r)) < 0)        \
>         ?       -1                                      \
> -       : (((longclock_t)(l) > (longclock_t)(r))        \
> +       : ((((long long)(l) - (long long)(r)) > 0)      \
>         ?       +1 : 0))

This patch may be correct, but for practical reasons it won't
make any difference.

Thanks,

Dejan

> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] [patch] clock_t wrapped around causingfalse resourcestart failure

Reply via email to