RE: [PATCH 0/9] sched_clock fixes
For bisecting the regression we ran 14 test for 50 repetitions. Before the bisected regression: commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649 Author: Pavel Tatashin <pasha.tatas...@oracle.com> Date: Wed Mar 22 16:24:17 2017 -0400 sched/clock: Fix broken stable to unstable transfer there was ~0 failing test on the Core2 machine. After regression ~350 failing tests. With your patch-set ~15 failing tests. To be honest, I must say that these test used to give unstable results on the Core2. But some time ago, the results magically stabilized at ~0 fails, by timing related fixes for other issues. Ville Syrjala now has a patch-set that we believe really solves the graphics parts of the issue. However, I believe that your patch-set still improves the situation related to the tsc instability of the Core2. /Marta > -Original Message- > From: Peter Zijlstra [mailto:pet...@infradead.org] > Sent: Tuesday, April 25, 2017 4:45 PM > To: Lofstedt, Marta <marta.lofst...@intel.com> > Cc: t...@linutronix.de; mi...@kernel.org; linux-kernel@vger.kernel.org; > ville.syrj...@linux.intel.com; daniel.lezc...@linaro.org; Wysocki, Rafael J > <rafael.j.wyso...@intel.com>; martin.pe...@linux.intel.com; > pasha.tatas...@oracle.com; daniel.vet...@ffwll.ch > Subject: Re: [PATCH 0/9] sched_clock fixes > > On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote: > > Hi Peterz, > > > > I tested your patch-set on the same Core2 machine as where we > discovered the regression. > > With the tsc=unstable boot param that passrate has improved significantly; > 350 fails -> 15 fails. > > So is that the same as before, or still worse? I don't really have a handle on > what your benchmark is here, nor what how 'good' is defined. > > If its still worse than before, I'm completely confused. Because with > "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset > == 0).
RE: [PATCH 0/9] sched_clock fixes
For bisecting the regression we ran 14 test for 50 repetitions. Before the bisected regression: commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649 Author: Pavel Tatashin Date: Wed Mar 22 16:24:17 2017 -0400 sched/clock: Fix broken stable to unstable transfer there was ~0 failing test on the Core2 machine. After regression ~350 failing tests. With your patch-set ~15 failing tests. To be honest, I must say that these test used to give unstable results on the Core2. But some time ago, the results magically stabilized at ~0 fails, by timing related fixes for other issues. Ville Syrjala now has a patch-set that we believe really solves the graphics parts of the issue. However, I believe that your patch-set still improves the situation related to the tsc instability of the Core2. /Marta > -Original Message- > From: Peter Zijlstra [mailto:pet...@infradead.org] > Sent: Tuesday, April 25, 2017 4:45 PM > To: Lofstedt, Marta > Cc: t...@linutronix.de; mi...@kernel.org; linux-kernel@vger.kernel.org; > ville.syrj...@linux.intel.com; daniel.lezc...@linaro.org; Wysocki, Rafael J > ; martin.pe...@linux.intel.com; > pasha.tatas...@oracle.com; daniel.vet...@ffwll.ch > Subject: Re: [PATCH 0/9] sched_clock fixes > > On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote: > > Hi Peterz, > > > > I tested your patch-set on the same Core2 machine as where we > discovered the regression. > > With the tsc=unstable boot param that passrate has improved significantly; > 350 fails -> 15 fails. > > So is that the same as before, or still worse? I don't really have a handle on > what your benchmark is here, nor what how 'good' is defined. > > If its still worse than before, I'm completely confused. Because with > "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset > == 0).
Re: [PATCH 0/9] sched_clock fixes
On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote: > Hi Peterz, > > I tested your patch-set on the same Core2 machine as where we discovered the > regression. > With the tsc=unstable boot param that passrate has improved significantly; > 350 fails -> 15 fails. So is that the same as before, or still worse? I don't really have a handle on what your benchmark is here, nor what how 'good' is defined. If its still worse than before, I'm completely confused. Because with "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset == 0).
Re: [PATCH 0/9] sched_clock fixes
On Tue, Apr 25, 2017 at 09:31:40AM +, Lofstedt, Marta wrote: > Hi Peterz, > > I tested your patch-set on the same Core2 machine as where we discovered the > regression. > With the tsc=unstable boot param that passrate has improved significantly; > 350 fails -> 15 fails. So is that the same as before, or still worse? I don't really have a handle on what your benchmark is here, nor what how 'good' is defined. If its still worse than before, I'm completely confused. Because with "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset == 0).
RE: [PATCH 0/9] sched_clock fixes
Hi Peterz, I tested your patch-set on the same Core2 machine as where we discovered the regression. With the tsc=unstable boot param that passrate has improved significantly; 350 fails -> 15 fails. BR, Marta > -Original Message- > From: Peter Zijlstra [mailto:pet...@infradead.org] > Sent: Friday, April 21, 2017 5:58 PM > To: t...@linutronix.de; mi...@kernel.org > Cc: linux-kernel@vger.kernel.org; ville.syrj...@linux.intel.com; > daniel.lezc...@linaro.org; Wysocki, Rafael J; > Lofstedt, Marta ; martin.pe...@linux.intel.com; > pasha.tatas...@oracle.com; pet...@infradead.org; daniel.vet...@ffwll.ch > Subject: [PATCH 0/9] sched_clock fixes > > Hi, > > These patches were inspired (and hopefully fix) two independent bug > reports on > Core2 machines. > > I never could quite reproduce one, but my Core2 machine no longer switches > to stable sched_clock and therefore no longer tickles the problematic stable - > > unstable transition either. > > Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking > random values into the TSC MSR from userspace. Behaviour in that case is > improved as well. > > People have to realize that if we manage to boot with TSC 'stable' (both > sched_clock and clocksource) and we later find out we were mistaken (we > observe a TSC wobble) the clocks that are derived from it _will_ have had an > observable hickup. This is fundamentally unfixable. > > If you own a machine where the BIOS tries to hide SMI latencies by > rewinding TSC (yes, this is a thing), the very best we can do is mark TSC > unstable with a boot parameter. > > For example, this is me writing a stupid value into the TSC: > > [ 46.745082] random: crng init done > [18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking > clocksource 'tsc' as unstable because the skew is too large: > [18443029775.023141] clocksource: 'hpet' wd_now: > 3ebec538 > wd_last: 3e486ec9 mask: > [18443029775.034214] clocksource: 'tsc' cs_now: > 5025acce9 cs_last: > 24dc3bd21c88ee mask: > [18443029775.046651] tsc: Marking TSC unstable due to clocksource > watchdog [18443029775.054211] TSC found unstable after boot, most likely > due to broken BIOS. Use 'tsc=unstable'. > [18443029775.064434] sched_clock: Marking unstable (70569005835, - > 17833788)<-(-3714295689546517, -2965802361) > [ 70.573700] clocksource: Switched to clocksource hpet > > With some trace_printk()s (not included) I could tell that the wobble occured > at 69.965474. The clock now resumes where it 'should' have been. > > But an unfortunate scheduling event could have resulted in one task having > seen a runtime of ~584 years with 'obvious' effects. Similar jumps can also be > observed from userspace GTOD usage. >
RE: [PATCH 0/9] sched_clock fixes
Hi Peterz, I tested your patch-set on the same Core2 machine as where we discovered the regression. With the tsc=unstable boot param that passrate has improved significantly; 350 fails -> 15 fails. BR, Marta > -Original Message- > From: Peter Zijlstra [mailto:pet...@infradead.org] > Sent: Friday, April 21, 2017 5:58 PM > To: t...@linutronix.de; mi...@kernel.org > Cc: linux-kernel@vger.kernel.org; ville.syrj...@linux.intel.com; > daniel.lezc...@linaro.org; Wysocki, Rafael J ; > Lofstedt, Marta ; martin.pe...@linux.intel.com; > pasha.tatas...@oracle.com; pet...@infradead.org; daniel.vet...@ffwll.ch > Subject: [PATCH 0/9] sched_clock fixes > > Hi, > > These patches were inspired (and hopefully fix) two independent bug > reports on > Core2 machines. > > I never could quite reproduce one, but my Core2 machine no longer switches > to stable sched_clock and therefore no longer tickles the problematic stable - > > unstable transition either. > > Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking > random values into the TSC MSR from userspace. Behaviour in that case is > improved as well. > > People have to realize that if we manage to boot with TSC 'stable' (both > sched_clock and clocksource) and we later find out we were mistaken (we > observe a TSC wobble) the clocks that are derived from it _will_ have had an > observable hickup. This is fundamentally unfixable. > > If you own a machine where the BIOS tries to hide SMI latencies by > rewinding TSC (yes, this is a thing), the very best we can do is mark TSC > unstable with a boot parameter. > > For example, this is me writing a stupid value into the TSC: > > [ 46.745082] random: crng init done > [18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking > clocksource 'tsc' as unstable because the skew is too large: > [18443029775.023141] clocksource: 'hpet' wd_now: > 3ebec538 > wd_last: 3e486ec9 mask: > [18443029775.034214] clocksource: 'tsc' cs_now: > 5025acce9 cs_last: > 24dc3bd21c88ee mask: > [18443029775.046651] tsc: Marking TSC unstable due to clocksource > watchdog [18443029775.054211] TSC found unstable after boot, most likely > due to broken BIOS. Use 'tsc=unstable'. > [18443029775.064434] sched_clock: Marking unstable (70569005835, - > 17833788)<-(-3714295689546517, -2965802361) > [ 70.573700] clocksource: Switched to clocksource hpet > > With some trace_printk()s (not included) I could tell that the wobble occured > at 69.965474. The clock now resumes where it 'should' have been. > > But an unfortunate scheduling event could have resulted in one task having > seen a runtime of ~584 years with 'obvious' effects. Similar jumps can also be > observed from userspace GTOD usage. >