Re: Use fences for kernel tsc reads.

2012-07-29 Thread Konstantin Belousov
On Sun, Jul 29, 2012 at 02:58:36PM +1000, Bruce Evans wrote:
 On Sat, 28 Jul 2012, Konstantin Belousov wrote:
 ...
 Handling of usermode will follow later.
 
 I hesitate to mention that this doesn't pessimize all uses of rdtsc:
 - x86/isa/clock.c uses raw rdtsc32() for DELAY()
There, fence is not needed because we do not compare counters on
different cores. Note that delay_tc() explicitely performs pin when
tsc is going to be used.

 - x86/x86/mca.c uses raw rdtsc() for mca_check_status()
mca use of rdtsc() seems to tbe purely informational.

 diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
 index c253a96..101cbb3 100644
 --- a/sys/x86/x86/tsc.c
 +++ b/sys/x86/x86/tsc.c
 ...
 @@ -328,15 +344,26 @@ init_TSC(void)
 
 #ifdef SMP
 
 -/* rmb is required here because rdtsc is not a serializing instruction. */
 +/*
 + * RDTSC is not a serializing instruction, so we need to drain
 + * instruction stream before executing it. It could be fixed by use of
 + * RDTSCP, except the instruction is not available everywhere.
 
 Sentence breaks are 2 spaces in KNF.  There was a recent thread about this.
Changed.

 RDTSC apparently doesn't need full serialization (else fences wouldn't
 be enough, and the necessary serialization would be slower).  The
 necessary serialization is very complex and not fully described above,
 but the above is not a good place to describe many details.  It already
 describes too much.  It is obvious that we shouldn't use large code or
 ifdefs to be portable in places where we don't care at all about
 efficiency.  Only delicate timing requirements would prevent us using
 the simple CPUID in such code.  For example, in code very much like
 this, we might need to _not_ slow things down, since we might want
 to see small differences between the TSCs on different CPUs.
This is argument to stop using do_cpuid(), since it causes relatively
large amount of unneeded memory writes. I inlined CPUID instead with
explicit register clobber list.

 
 @@ -592,23 +628,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS)
 SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW,
 0, 0, sysctl_machdep_tsc_freq, QU, Time Stamp Counter frequency);
 
 -static u_int
 +static inline u_int
 tsc_get_timecount(struct timecounter *tc __unused)
 {
 
  return (rdtsc32());
 }
 
 Adding `inline' gives up to 5 (!) style bugs at different levels:
 - `inline' is usually spelled `__inline'.  The former is now standard,
   but sys/cdefs.h never caught up with this and has better magic for
   the latter.  And `inline' doesn't match most existing spellings.
There is no reasons to use __inline in the kernel code anymore. We are in
C99 mode always when compiling kernel.

 - the function is still forward-declared without `inline'.  The differenc
   is confusing.
 - forward declaration of inline functions is nonsense.  It sort of asks
   for early uses to be non-inline and late uses to be inline.  But
   gcc now doesn't care much about the order, since -funit-at-a-time is
   the default.
 - this function is never actually inline.  You added calls to the others
   but not this.
 - if this function were inline, then this might happen automatically due
   to -finline-functions-called-once.  See below.
 
 
 -static u_int
 +static inline u_int
 tsc_get_timecount_low(struct timecounter *tc)
 {
  uint32_t rv;
 
  __asm __volatile(rdtsc; shrd %%cl, %%edx, %0
 -: =a (rv) : c ((int)(intptr_t)tc-tc_priv) : edx);
 +: =a (rv) : c ((int)(intptr_t)tc-tc_priv) : edx);
  return (rv);
 }
...
Added inline to declaration of tsc_get_timecounter_low. And removed
'inline' from other functions. Compiler still generates exactly '*fence;
rdtsc' sequences. In truth, for _low variabnts, the load of tc_priv is
done between fence and rdtsc, but I kept it as is for now.

diff --git a/sys/amd64/include/cpufunc.h b/sys/amd64/include/cpufunc.h
index 94d4133..881fcd2 100644
--- a/sys/amd64/include/cpufunc.h
+++ b/sys/amd64/include/cpufunc.h
@@ -290,6 +290,13 @@ popcntq(u_long mask)
 }
 
 static __inline void
+lfence(void)
+{
+
+   __asm __volatile(lfence : : : memory);
+}
+
+static __inline void
 mfence(void)
 {
 
diff --git a/sys/i386/include/cpufunc.h b/sys/i386/include/cpufunc.h
index 62d268d..7cd3663 100644
--- a/sys/i386/include/cpufunc.h
+++ b/sys/i386/include/cpufunc.h
@@ -155,6 +155,13 @@ cpu_mwait(u_long extensions, u_int hints)
 }
 
 static __inline void
+lfence(void)
+{
+
+   __asm __volatile(lfence : : : memory);
+}
+
+static __inline void
 mfence(void)
 {
 
diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
index c253a96..3d8bd30 100644
--- a/sys/x86/x86/tsc.c
+++ b/sys/x86/x86/tsc.c
@@ -82,7 +82,11 @@ static void tsc_freq_changed(void *arg, const struct 
cf_level *level,
 static void tsc_freq_changing(void *arg, const struct cf_level *level,
 int *status);
 static unsigned tsc_get_timecount(struct timecounter *tc);
-static unsigned tsc_get_timecount_low(struct timecounter *tc);
+static inline unsigned 

Re: Use fences for kernel tsc reads.

2012-07-28 Thread Konstantin Belousov
On Sat, Jul 28, 2012 at 09:36:43AM -0700, Jim Harris wrote:
 On Saturday, July 28, 2012, Konstantin Belousov wrote:
 
  Hi,
 
  This was discussed on somewhat unwieldly thread on svn-src@ as a followup
  to the commit r238755 which uncovered the problem in the first place.
 
  Below is the commit candidate. It changes the fix in r238755 to use CPUID
  instead of rmb(), since rmb() translates to locked operation on i386,
  and experimentation shown that Nehalem's RDTSC can pass LOCK.
 
  Also, I explicitely use LFENCE and MFENCE instead using rmb() and mb(),
  for the same reason that on i386 implementation of *mb() is a locked
  operation instead of fence.
 
  Jim, could you please, recheck that on your machine there is no regression
  in the TSC synchronization test ?
 
 Ok. I will do this first thing on Monday. Should I still MFC my patch to
 stable-9?

Definitely, please merge. I do not think that SMP  !SSE2 is important
population for us, and since SMP machines of that kind were definitely
multi-socket, I doubt that TSC was synchronized for them anyway. Users
will need to disable tsc manually.



pgpXpISs1bceo.pgp
Description: PGP signature