RE: [PATCH 0/9] sched_clock fixes

2017-04-26 Thread Lofstedt, Marta
For bisecting the regression we ran 14 test for 50 repetitions.

Before the bisected regression:

commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649
Author: Pavel Tatashin <pasha.tatas...@oracle.com>
Date:   Wed Mar 22 16:24:17 2017 -0400

sched/clock: Fix broken stable to unstable transfer

there was ~0 failing test on the Core2 machine. 
After regression ~350 failing tests.
With your patch-set ~15 failing tests.

To be honest, I must say that these test used to give unstable results on the 
Core2. But some time ago, the results magically stabilized at ~0 fails, by 
timing related fixes for other issues. Ville Syrjala now has a patch-set that 
we believe really solves the graphics parts of the issue. However, I believe 
that your patch-set still improves the situation related to the tsc instability 
of the Core2. 

/Marta

> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Tuesday, April 25, 2017 4:45 PM
> To: Lofstedt, Marta <marta.lofst...@intel.com>
> Cc: t...@linutronix.de; mi...@kernel.org; linux-kernel@vger.kernel.org;
> ville.syrj...@linux.intel.com; daniel.lezc...@linaro.org; Wysocki, Rafael J
> <rafael.j.wyso...@intel.com>; martin.pe...@linux.intel.com;
> pasha.tatas...@oracle.com; daniel.vet...@ffwll.ch
> Subject: Re: [PATCH 0/9] sched_clock fixes
> 
> On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote:
> > Hi Peterz,
> >
> > I tested your patch-set on the same Core2 machine as where we
> discovered the regression.
> > With the tsc=unstable boot param that passrate has improved significantly;
> 350 fails -> 15 fails.
> 
> So is that the same as before, or still worse? I don't really have a handle on
> what your benchmark is here, nor what how 'good' is defined.
> 
> If its still worse than before, I'm completely confused. Because with
> "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
> == 0).


RE: [PATCH 0/9] sched_clock fixes

2017-04-26 Thread Lofstedt, Marta
For bisecting the regression we ran 14 test for 50 repetitions.

Before the bisected regression:

commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649
Author: Pavel Tatashin 
Date:   Wed Mar 22 16:24:17 2017 -0400

sched/clock: Fix broken stable to unstable transfer

there was ~0 failing test on the Core2 machine. 
After regression ~350 failing tests.
With your patch-set ~15 failing tests.

To be honest, I must say that these test used to give unstable results on the 
Core2. But some time ago, the results magically stabilized at ~0 fails, by 
timing related fixes for other issues. Ville Syrjala now has a patch-set that 
we believe really solves the graphics parts of the issue. However, I believe 
that your patch-set still improves the situation related to the tsc instability 
of the Core2. 

/Marta

> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Tuesday, April 25, 2017 4:45 PM
> To: Lofstedt, Marta 
> Cc: t...@linutronix.de; mi...@kernel.org; linux-kernel@vger.kernel.org;
> ville.syrj...@linux.intel.com; daniel.lezc...@linaro.org; Wysocki, Rafael J
> ; martin.pe...@linux.intel.com;
> pasha.tatas...@oracle.com; daniel.vet...@ffwll.ch
> Subject: Re: [PATCH 0/9] sched_clock fixes
> 
> On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote:
> > Hi Peterz,
> >
> > I tested your patch-set on the same Core2 machine as where we
> discovered the regression.
> > With the tsc=unstable boot param that passrate has improved significantly;
> 350 fails -> 15 fails.
> 
> So is that the same as before, or still worse? I don't really have a handle on
> what your benchmark is here, nor what how 'good' is defined.
> 
> If its still worse than before, I'm completely confused. Because with
> "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
> == 0).


Re: [PATCH 0/9] sched_clock fixes

2017-04-25 Thread Peter Zijlstra
On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote:
> Hi Peterz,
> 
> I tested your patch-set on the same Core2 machine as where we discovered the 
> regression. 
> With the tsc=unstable boot param that passrate has improved significantly; 
> 350 fails -> 15 fails.

So is that the same as before, or still worse? I don't really have a
handle on what your benchmark is here, nor what how 'good' is defined.

If its still worse than before, I'm completely confused. Because with
"tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
== 0).


Re: [PATCH 0/9] sched_clock fixes

2017-04-25 Thread Peter Zijlstra
On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote:
> Hi Peterz,
> 
> I tested your patch-set on the same Core2 machine as where we discovered the 
> regression. 
> With the tsc=unstable boot param that passrate has improved significantly; 
> 350 fails -> 15 fails.

So is that the same as before, or still worse? I don't really have a
handle on what your benchmark is here, nor what how 'good' is defined.

If its still worse than before, I'm completely confused. Because with
"tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
== 0).


RE: [PATCH 0/9] sched_clock fixes

2017-04-25 Thread Lofstedt, Marta
Hi Peterz,

I tested your patch-set on the same Core2 machine as where we discovered the 
regression. 
With the tsc=unstable boot param that passrate has improved significantly; 350 
fails -> 15 fails.

BR,
Marta

> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Friday, April 21, 2017 5:58 PM
> To: t...@linutronix.de; mi...@kernel.org
> Cc: linux-kernel@vger.kernel.org; ville.syrj...@linux.intel.com;
> daniel.lezc...@linaro.org; Wysocki, Rafael J ;
> Lofstedt, Marta ; martin.pe...@linux.intel.com;
> pasha.tatas...@oracle.com; pet...@infradead.org; daniel.vet...@ffwll.ch
> Subject: [PATCH 0/9] sched_clock fixes
> 
> Hi,
> 
> These patches were inspired (and hopefully fix) two independent bug
> reports on
> Core2 machines.
> 
> I never could quite reproduce one, but my Core2 machine no longer switches
> to stable sched_clock and therefore no longer tickles the problematic stable -
> > unstable transition either.
> 
> Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking
> random values into the TSC MSR from userspace. Behaviour in that case is
> improved as well.
> 
> People have to realize that if we manage to boot with TSC 'stable' (both
> sched_clock and clocksource) and we later find out we were mistaken (we
> observe a TSC wobble) the clocks that are derived from it _will_ have had an
> observable hickup. This is fundamentally unfixable.
> 
> If you own a machine where the BIOS tries to hide SMI latencies by
> rewinding TSC (yes, this is a thing), the very best we can do is mark TSC
> unstable with a boot parameter.
> 
> For example, this is me writing a stupid value into the TSC:
> 
> [   46.745082] random: crng init done
> [18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [18443029775.023141] clocksource:   'hpet' wd_now: 
> 3ebec538
> wd_last: 3e486ec9 mask: 
> [18443029775.034214] clocksource:   'tsc' cs_now: 
> 5025acce9 cs_last:
> 24dc3bd21c88ee mask: 
> [18443029775.046651] tsc: Marking TSC unstable due to clocksource
> watchdog [18443029775.054211] TSC found unstable after boot, most likely
> due to broken BIOS. Use 'tsc=unstable'.
> [18443029775.064434] sched_clock: Marking unstable (70569005835, -
> 17833788)<-(-3714295689546517, -2965802361)
> [   70.573700] clocksource: Switched to clocksource hpet
> 
> With some trace_printk()s (not included) I could tell that the wobble occured
> at 69.965474. The clock now resumes where it 'should' have been.
> 
> But an unfortunate scheduling event could have resulted in one task having
> seen a runtime of ~584 years with 'obvious' effects. Similar jumps can also be
> observed from userspace GTOD usage.
> 



RE: [PATCH 0/9] sched_clock fixes

2017-04-25 Thread Lofstedt, Marta
Hi Peterz,

I tested your patch-set on the same Core2 machine as where we discovered the 
regression. 
With the tsc=unstable boot param that passrate has improved significantly; 350 
fails -> 15 fails.

BR,
Marta

> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Friday, April 21, 2017 5:58 PM
> To: t...@linutronix.de; mi...@kernel.org
> Cc: linux-kernel@vger.kernel.org; ville.syrj...@linux.intel.com;
> daniel.lezc...@linaro.org; Wysocki, Rafael J ;
> Lofstedt, Marta ; martin.pe...@linux.intel.com;
> pasha.tatas...@oracle.com; pet...@infradead.org; daniel.vet...@ffwll.ch
> Subject: [PATCH 0/9] sched_clock fixes
> 
> Hi,
> 
> These patches were inspired (and hopefully fix) two independent bug
> reports on
> Core2 machines.
> 
> I never could quite reproduce one, but my Core2 machine no longer switches
> to stable sched_clock and therefore no longer tickles the problematic stable -
> > unstable transition either.
> 
> Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking
> random values into the TSC MSR from userspace. Behaviour in that case is
> improved as well.
> 
> People have to realize that if we manage to boot with TSC 'stable' (both
> sched_clock and clocksource) and we later find out we were mistaken (we
> observe a TSC wobble) the clocks that are derived from it _will_ have had an
> observable hickup. This is fundamentally unfixable.
> 
> If you own a machine where the BIOS tries to hide SMI latencies by
> rewinding TSC (yes, this is a thing), the very best we can do is mark TSC
> unstable with a boot parameter.
> 
> For example, this is me writing a stupid value into the TSC:
> 
> [   46.745082] random: crng init done
> [18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [18443029775.023141] clocksource:   'hpet' wd_now: 
> 3ebec538
> wd_last: 3e486ec9 mask: 
> [18443029775.034214] clocksource:   'tsc' cs_now: 
> 5025acce9 cs_last:
> 24dc3bd21c88ee mask: 
> [18443029775.046651] tsc: Marking TSC unstable due to clocksource
> watchdog [18443029775.054211] TSC found unstable after boot, most likely
> due to broken BIOS. Use 'tsc=unstable'.
> [18443029775.064434] sched_clock: Marking unstable (70569005835, -
> 17833788)<-(-3714295689546517, -2965802361)
> [   70.573700] clocksource: Switched to clocksource hpet
> 
> With some trace_printk()s (not included) I could tell that the wobble occured
> at 69.965474. The clock now resumes where it 'should' have been.
> 
> But an unfortunate scheduling event could have resulted in one task having
> seen a runtime of ~584 years with 'obvious' effects. Similar jumps can also be
> observed from userspace GTOD usage.
>