Re: Fwd: 5-STABLE kernel build with icc broken

2005-04-02 Thread Bruce Evans
On Fri, 1 Apr 2005, Matthew Dillon wrote:
:The use of the XMM registers is a cpu optimization.  Modern CPUs,
:especially AMD Athlon and Opterons, are more efficient with 128 bit
:moves then with 64 bit moves.   I experimented with all sorts of
:configurations, including the use of special data caching instructions,
:but they had so many special cases and degenerate conditions that
:I found that simply using straight XMM instructions, reading as big
:a glob as possible, then writing the glob, was by far the best solution.
:
:Are you sure about that?  The amd64 optimization manual says (essentially)
This is in 25112.PDF section 5.16 (Interleave Loads and Stores, with
128 bits of loads followed by 128 bits of stores).
:that big globs are bad, and my benchmarks confirm this.  The best glob size
:is 128 bits according to my benchmarks.  This can be obtained using 2
:...
:
:Unfortunately (since I want to avoid using both MMX and XMM), I haven't
:managed to make copying through 64-integer registers work as well.
:Copying 128 bits at a time using 2 pairs of movq's through integer
:registers gives only 7.9GB/sec.  movq through MMX is never that slow.
:However, movdqu through xmm is even slower (7.4GB/sec).
I forgot many of my earlier conclusions when I wrote the above.  The
speeds between 7.4GB/sec and 12.9GB/sec for the fully (L1) cached case
are almost irrelevant.  They basically just tell how well we have
used the instruction bandwidth.  Plain movsq uses it better and gets
15.9GB/sec.  I believe 15.9GB/sec is from saturating the L1 cache.
The CPU is an Athlon64 and its clock frequency is 1994 MHz, and I think
the max L1 cache bandwidth is with a 16-byte load and store per cycle;
16*1994*10^6 is 15.95GB/sec (disk manufacturers GB's).
Plain movsq is best here for many other cases too...
:
:The fully cached case is too unrepresentative of normal use, and normal
:(partially cached) use is hard to bencmark, so I normally benchmark
:the fully uncached case.  For that, movnt* is best for benchmarks but
:not for general use, and it hardly matters which registers are used.
   Yah, I'm pretty sure.  I tested the fully cached (L1), partially
   cached (L2), and the fully uncached cases.   I don't have a logic
By the partially cached case, I meant the case where some of the source
and/or target addresses are in the L1 or L2 cache, but you don't really
the chance that they are there (or should be there after the copy), so
you can only guess the best strategy.
   analyzer but what I think is happening is that the cpu's write buffer
   is messing around with the reads and causing extra RAS cycles to occur.
   I also tested using various combinations of movdqa, movntdq, and
   prefetcha.
Somehow I'm only seeing small variations from different strategies now,
with all tests done in userland on an Athlon64 system (and on athlonXP
systems for reference).  Using XMM or MMX can be twice as fast on
the AthlonXPs, but movsq is absolutely the fastest in many cases on
the Athlon64, and is  5% slower than the fastest in all cases
(except for the fully uncached case since it can't do nontemporal
stores), so it is the best general method.
...
   I also think there might be some odd instruction pipeline effects
   that skew the results when only one or two instructions are between
   the load into an %xmm register and the store from the same register.
   I tried using 2, 4, and 8 XMM registers.  8 XMM registers seemed to
   work the best.
I'm getting only small variations from different load/store patterns.
   Of course, I primarily tested on an Athlon 64 3200+, so YMMV.  (One
   of the first Athlon 64's, so it has a 1MB L2 cache).
My test system is very similar:
%%%
CPU: AMD Athlon(tm) 64 Processor 3400+ (1994.33-MHz K8-class CPU)
  Origin = AuthenticAMD  Id = 0xf48  Stepping = 8
  
Features=0x78bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2
  AMD Features=0xe0500800SYSCALL,NX,MMX+,LM,3DNow+,3DNow
L1 2MB data TLB: 8 entries, fully associative
L1 2MB instruction TLB: 8 entries, fully associative
L1 4KB data TLB: 32 entries, fully associative
L1 4KB instruction TLB: 32 entries, fully associative
L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L2 2MB unified TLB: 0 entries, disabled/not present
L2 4KB data TLB: 512 entries, 4-way associative
L2 4KB instruction TLB: 512 entries, 4-way associative
L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative
%%%
   The prefetchnta I have commented out seemed to improve performance,
   but it requires 3dNOW and I didn't want to NOT have an MMX copy mode
   for cpu's with MMX but without 3dNOW.  Prefetching less then 128 bytes
   did not help, and prefetching greater then 128 bytes (e.g. 256(%esi))
   seemed to cause extra RAS cycles.  It was unbelievably finicky, not at
   all what I expected.

Re: Fwd: 5-STABLE kernel build with icc broken

2005-04-01 Thread Bruce Evans
On Thu, 31 Mar 2005, Matthew Dillon wrote:
I didn't mean to get into the kernel's use of the FPU, but...
   All I really did was implement a comment that DG had made many years
   ago in the PCB structure about making the FPU save area a pointer rather
   then hardwiring it into the PCB.
ISTR writing something like that.  dg committed most of my early work
since I didn't have commit access at the time.
...
   The use of the XMM registers is a cpu optimization.  Modern CPUs,
   especially AMD Athlon and Opterons, are more efficient with 128 bit
   moves then with 64 bit moves.   I experimented with all sorts of
   configurations, including the use of special data caching instructions,
   but they had so many special cases and degenerate conditions that
   I found that simply using straight XMM instructions, reading as big
   a glob as possible, then writing the glob, was by far the best solution.
Are you sure about that?  The amd64 optimization manual says (essentially)
that big globs are bad, and my benchmarks confirm this.  The best glob size
is 128 bits according to my benchmarks.  This can be obtained using 2
64-bit reads of 64-bit registers followed by 2 64-bit writes of these
registers, or by read-write of a single 128-bit register.  The 64-bit
registers can be either MMX or integer registers on 64-bit systems, but
the 128-registers must be XMM on all systems.  I get identical speeds
of 12.9GB/sec (+-0.1GB/sec) on a fairly old and slow Athlon64 system
for copying 16K (fully cached) through MMX and XMM 128 bits at a time
using the following instructions:
# MMX:  # XMM
movq(%0),%mm0   movdqa  (%0),%xmm0
movq8(%0),%mm1  movdqa  %xmm0,(%1)
movq%mm0,(%1)   ... # unroll same amount
movq%mm1,8(%1)
... # unroll to copy 64 bytes per iteration
Unfortunately (since I want to avoid using both MMX and XMM), I haven't
managed to make copying through 64-integer registers work as well.
Copying 128 bits at a time using 2 pairs of movq's through integer
registers gives only 7.9GB/sec.  movq through MMX is never that slow.
However, movdqu through xmm is even slower (7.4GB/sec).
The fully cached case is too unrepresentative of normal use, and normal
(partially cached) use is hard to bencmark, so I normally benchmark
the fully uncached case.  For that, movnt* is best for benchmarks but
not for general use, and it hardly matters which registers are used.
   The key for fast block copying is to not issue any memory writes other
   then those related directly to the data being copied.  This avoids
   unnecessary RAS cycles which would otherwise kill copying performance.
   In tests I found that copying multi-page blocks in a single loop was
   far more efficient then copying data page-by-page precisely because
   page-by-page copying was too complex to be able to avoid extranious
   writes to memory unrelated to the target buffer inbetween each page copy.
By page-by-page, do you mean prefetch a page at a time into the L1
cache?
I've noticed strange loss (apparently) from extraneous reads or writes
more for benchmarks that do just (very large) writes.  An at least old
Celerons and AthlonXPs, the writes go straight to the L1/L2 caches
(unless you use movntq on AthlonXP's).  The caches are flushed to main
memory some time later, apparently not very well since some pages take
more than twice as long to write as others (as seen by the writer
filling the caches), and the slow case happens enough to affect the
average write speed by up to 50%.  This problem can be reduced by
putting memory bank bits in the page colors.  This is hard to get right
even for the simple unrepresentative case of large writes.
Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-04-01 Thread Matthew Dillon

:The use of the XMM registers is a cpu optimization.  Modern CPUs,
:especially AMD Athlon and Opterons, are more efficient with 128 bit
:moves then with 64 bit moves.   I experimented with all sorts of
:configurations, including the use of special data caching instructions,
:but they had so many special cases and degenerate conditions that
:I found that simply using straight XMM instructions, reading as big
:a glob as possible, then writing the glob, was by far the best solution.
:
:Are you sure about that?  The amd64 optimization manual says (essentially)
:that big globs are bad, and my benchmarks confirm this.  The best glob size
:is 128 bits according to my benchmarks.  This can be obtained using 2
:...
:
:Unfortunately (since I want to avoid using both MMX and XMM), I haven't
:managed to make copying through 64-integer registers work as well.
:Copying 128 bits at a time using 2 pairs of movq's through integer
:registers gives only 7.9GB/sec.  movq through MMX is never that slow.
:However, movdqu through xmm is even slower (7.4GB/sec).
:
:The fully cached case is too unrepresentative of normal use, and normal
:(partially cached) use is hard to bencmark, so I normally benchmark
:the fully uncached case.  For that, movnt* is best for benchmarks but
:not for general use, and it hardly matters which registers are used.

Yah, I'm pretty sure.  I tested the fully cached (L1), partially
cached (L2), and the fully uncached cases.   I don't have a logic 
analyzer but what I think is happening is that the cpu's write buffer
is messing around with the reads and causing extra RAS cycles to occur.
I also tested using various combinations of movdqa, movntdq, and
prefetcha.  carefully arranged non-temporal and/or prefetch instructions
were much faster for the uncached case, but much, MUCH slower for
the partially cached (L2) or fully (L1) cached case, making them 
unsuitable for a generic copy.  I am rather miffed that AMD screwed up
the non-temporal instructions so badly.

I also think there might be some odd instruction pipeline effects
that skew the results when only one or two instructions are between
the load into an %xmm register and the store from the same register.
I tried using 2, 4, and 8 XMM registers.  8 XMM registers seemed to
work the best.
  
Of course, I primarily tested on an Athlon 64 3200+, so YMMV.  (One
of the first Athlon 64's, so it has a 1MB L2 cache).

:The key for fast block copying is to not issue any memory writes other
:then those related directly to the data being copied.  This avoids
:unnecessary RAS cycles which would otherwise kill copying performance.
:In tests I found that copying multi-page blocks in a single loop was
:far more efficient then copying data page-by-page precisely because
:page-by-page copying was too complex to be able to avoid extranious
:writes to memory unrelated to the target buffer inbetween each page copy.
:
:By page-by-page, do you mean prefetch a page at a time into the L1
:cache?

No, I meant that copying taking, e.g. a vm_page_t array and doing
page-by-page mappings copying in 4K chunks seems to be a lot slower
then doing a linear mapping of the entire vm_page_t array and doing
one big copy.  Literally the same code, just rearranged a bit.  Just
writing to the stack in between each page was enough to throw it off.

:I've noticed strange loss (apparently) from extraneous reads or writes
:more for benchmarks that do just (very large) writes.  An at least old
:Celerons and AthlonXPs, the writes go straight to the L1/L2 caches
:(unless you use movntq on AthlonXP's).  The caches are flushed to main
:memory some time later, apparently not very well since some pages take
:more than twice as long to write as others (as seen by the writer
:filling the caches), and the slow case happens enough to affect the
:average write speed by up to 50%.  This problem can be reduced by
:putting memory bank bits in the page colors.  This is hard to get right
:even for the simple unrepresentative case of large writes.
:
:Bruce

I've seen the same effects and come to the same conclusion.  The
copy code I eventually settled on was this (taken from my i386/bcopy.s).
It isn't as fast as using movntdq for the fully uncached case, but it
seems to perform in the system the best because the data tends to be
accessed and in the cache by someone in real life (e.g. source data
tends to be in the cache even if the device driver doesn't touch the
target data).

I wish AMD had made movntdq work the same as movdqa for the case where
the data was already in the cache, then movntdq would have been the
clear winner.

The prefetchnta I have commented out seemed to improve performance,
but it requires 3dNOW and I didn't want to NOT have an MMX copy mode
for cpu's with MMX but without 3dNOW.  Prefetching 

Re: Fwd: 5-STABLE kernel build with icc broken

2005-04-01 Thread Matthew Dillon
Here is the core of the FPU setup and restoration code for the kernel
bcopy in DragonFly, from i386/bcopy.s.

DragonFly uses the TD_SAVEFPU-is-a-pointer method that was outlined in
the original comment in the FreeBSD code.  I further enhance the
algorithm to guarentee that the FPU is in a sane state (does not
require any further initialization other then a clts) if userland has
NOT used it.  However, there are definitely some race cases that
must be considered (see the comments).

The on-fault handling in DragonFly is stackable (which further simplifies
the whole mess of on-fault vs non-on-fault copying code) and the DFly
bcopy just sets up the frame for it whether or not the onfault handling 
is actually needed.

This could be further optimized, but I had already spent at least a month
on it and had to move on to other things.  In particular, the setting
of CR0_TS and the restoration of TD_SAVEFPU could be moved to the
syscall-return code, so multiple in-kernel bcopy operations could be
issued without any further FPU setup or teardown.

-Matt

/*
 * RACES/ALGORITHM:
 *
 *  If gd_npxthread is not NULL we must save the application's
 *  current FP state to the current save area and then NULL
 *  out gd_npxthread to interlock against new interruptions
 *  changing the FP state further.
 *
 *  If gd_npxthread is NULL the FP unit is in a known 'safe'
 *  state and may be used once the new save area is installed.
 *
 *  race(1): If an interrupt occurs just prior to calling fxsave
 *  all that happens is that fxsave gets a npxdna trap, restores
 *  the app's environment, and immediately traps, restores,
 *  and saves it again.
 *
 *  race(2): No interrupt can safely occur after we NULL-out
 *  npxthread until we fninit, because the kernel assumes that
 *  the FP unit is in a safe state when npxthread is NULL.  It's
 *  more convenient to use a cli sequence here (it is not
 *  considered to be in the critical path), but a critical
 *  section would also work.
 *
 *  race(3): The FP unit is in a known state (because npxthread
 *  was either previously NULL or we saved and init'd and made
 *  it NULL).  This is true even if we are preempted and the
 *  preempting thread uses the FP unit, because it will be
 *  fninit's again on return.  ANY STATE WE SAVE TO THE FPU MAY
 *  BE DESTROYED BY PREEMPTION WHILE NPXTHREAD IS NULL!  However,
 *  an interrupt occuring inbetween clts and the setting of
 *  gd_npxthread may set the TS bit again and cause the next
 *  npxdna() to panic when it sees a non-NULL gd_npxthread.
 *
 *  We can safely set TD_SAVEFPU to point to a new uninitialized
 *  save area and then set GD_NPXTHREAD to non-NULL.  If an
 *  interrupt occurs after we set GD_NPXTHREAD, all that happens
 *  is that the safe FP state gets saved and restored.  We do not
 *  need to fninit again.
 *
 *  We can safely clts after setting up the new save-area, before
 *  installing gd_npxthread, even if we get preempted just after
 *  calling clts.  This is because the FP unit will be in a safe
 *  state while gd_npxthread is NULL.  Setting gd_npxthread will
 *  simply lock-in that safe-state.  Calling clts saves
 *  unnecessary trap overhead since we are about to use the FP
 *  unit anyway and don't need to 'restore' any state prior to
 *  that first use.
 */

#define MMX_SAVE_BLOCK(missfunc)\
cmpl$2048,%ecx ;\
jb  missfunc ;  \
movlMYCPU,%eax ;/* EAX = MYCPU */   \
btsl$1,GD_FPU_LOCK(%eax) ;  \
jc  missfunc ;  \
pushl   %ebx ;  \
pushl   %ecx ;  \
movlGD_CURTHREAD(%eax),%edx ;   /* EDX = CURTHREAD */   \
movlTD_SAVEFPU(%edx),%ebx ; /* save app save area */\
addl$TDPRI_CRIT,TD_PRI(%edx) ;  \
cmpl$0,GD_NPXTHREAD(%eax) ; \
je  100f ;  \
fxsave  0(%ebx) ;   /* race(1) */   \
   

Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-31 Thread Peter Jeremy
On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote:
On the i386 (and probably most other CPUs), you can place the FPU into
am unavailable state.  This means that any attempt to use it will
trigger a trap.  The kernel will then restore FPU state and return.
On a normal system call, if the FPU hasn't been used, the kernel will
see that it's still in an unavailable state and can avoid saving the
state.  (On an i386, unavailable state is achieved by either setting
CR0_TS or CR0_EM).  This means you avoid having to always restore FPU
state at the expense of an additional trap if the process actually
uses the FPU.

I remember that you (Peter) did extensive benchmarks of this.

That was a long time ago and I don't recall them being that extensive.
I suspect the results are in my archives at work - I can't quickly
find them here.  From memory the tests were on 2.2 and just counted
the number of context switches, FP saves and restores.

  I still
think fully lazy switching (c2) is the best general method.

I think it depends on the FP workload.  It's a definite win if there
is exactly one FP thread - in this case the FPU state never needs to
be saved (and you could even optimise away the DNA trap by clearing
the TS and EM bits if the switched-to curthread is fputhread).

The worst case is two (or more) FP-intensive threads - in this case,
lazy switching is of no benefit.  The DNA trap overheads mean that
the performance is worse than just saving/restoring the FP state
during a context switch.

My guess is that the current generation workstation is closer to the
second case - current generation graphical bloatware uses a lot of
FP for rendering, not to mention that the idle task has a reasonable
chance of being an FP-intensive distributed computing task (setiathome
or similar).  It's probably time to do some more measuring (I'm not
offering just now, I have lots of other things on my TODO list).

SMP adds a whole new can of worms.  (I originally suspected that lazy
switching had been lost during the SMP transition).  Given CPU (FPU)
affinity, you can just add per CPU to the above but I'm not sure
that changes my conclusion.

  Maybe FP state should be loaded in advance based on FPU affinity.

Pre-loading the FPU state is an advantage for FP-intensive threads -
if the thread will definitely use the FPU before the next context
switch, you save the cost of a DNA trap by pre-loading the FPU state.

  It might be
good for CPU affinity to depend on FPU use (prfer not to switch
threads away from a CPU if they own that CPU via its FPU).

FPU affinity is only an advantage if full lazy switching is implemented.
(And I thought we didn't even have CPU affinity working well).  The
first step is probably gathering some data on whether lazy switching
is any benefit.

BTW, David and I recently found a bug in the context switching in the
fxsr case, at least on Athlon-XP's and AMD64's.

I gather this is not noticable unless the application is doing its
own FPU save/restore.  Is there a solution or work-around?

-- 
Peter Jeremy
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-31 Thread Bruce Evans
On Wed, 30 Mar 2005, David Schultz wrote:
On Wed, Mar 30, 2005, Peter Jeremy wrote:
On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote:
Later in that thread they discuss skipping the restore state to make
things faster.  The minimum buffer size they say this will be good for
is between 2-4k.  Does this make sense, or am I showing my ignorance?
http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html
Yes.  There are a variety of options for saving/restoring FPU state:
a) save FPU state on kernel entry
b) save FPU state on a context switch (or if the kernel wants the FPU)
c) only save FPU state if a different process (or the kernel) wants the FPU
1) restore FPU on kernel exit
2) restore FPU state if a process wants the FPU.
a and 1 are the most obvious - that's the way the integer registers are
handled.
I thought FreeBSD used to be c2 but it seems it has been changed to b2
since I looked last.
No, it always used b2.  I never got around to implementing c2.
Linux used to implement c2 on i386's, but I think it switched (to b2?) to
optimize (or at least simplify) the SMP case.
Based on the mail above, it looks like Dfly was changed from 1 to 2
(I'm not sure if it's 'a' or 'c' on save).
'a' seems to be too inefficient to ever use.  '1' makes sense if it
rarely happens and/or the kernel can often use the FPU more than once
per entry (which it probably shouldn't), but it gives complications
like the ones for SMP, especially in FreeBSD where the kernel can be
preempted.
Saving FP state as needed is simplest but can be slow.  My Athlon-with-
SSE-extensions pagecopy and pagezero routines use the FPU (XMM) but
their FP state save isn't slow because only 1 or 2 XMM registers needs
to be saved.  E.g., the saving part of sse_pagezero_for_some_athlons() is:
pushfl  # Also have to save %eflags.
cli # Switch %eflags as needed to safely use FPU.
movl%cr0,%eax   # Also have to save %cr0.
clts# Switch %cr0 as needed to use FPU.
subl$16,%esp# Space to save some FP state.
movups  %xmm0,(%esp)# Save some FP state.  Only this much needed.
On the i386 (and probably most other CPUs), you can place the FPU into
am unavailable state.  This means that any attempt to use it will
trigger a trap.  The kernel will then restore FPU state and return.
On a normal system call, if the FPU hasn't been used, the kernel will
see that it's still in an unavailable state and can avoid saving the
state.  (On an i386, unavailable state is achieved by either setting
CR0_TS or CR0_EM).  This means you avoid having to always restore FPU
state at the expense of an additional trap if the process actually
uses the FPU.
I remember that you (Peter) did extensive benchmarks of this.  I still
think fully lazy switching (c2) is the best general method.  Maybe FP
state should be loaded in advance based on FPU affinity.  It might be
good for CPU affinity to depend on FPU use (prfer not to switch
threads away from a CPU if they own that CPU via its FPU).
This is basically what FreeBSD does on i386 and amd64.  (As a
disclaimer, I haven't read the code very carefully, so I might be
missing some of the details.)  Upon taking a trap for a process
that has never used the FPU before, we save the FPU state for the
last process to use the FPU, then load a fresh FPU state.  On
We don't save the FPU state for the last thread then (c2 behaviour)
since we have already saved it it when we switched away from it.
npxdna() panics if we haven't done that.  Except rev.1.131 added bogus
code (apparently to debug or hide bugs in the other changes in rev.1.131)
that breaks the panic in the fpcurthread == curthread case.
subsequent context switches, the FPU state for processes that have
already used the FPU gets loaded before entering user mode, I
think.  I haven't studied the code in enough detail to know what
No, that doesn't happen.  Instead, cpu_switch() has called npxsave()
on the context switch away from the thread.  npxsave() arranges for
a trap on the next use of the FPU, and we don't do anything more with
the FPU context of the thread until the thread tries to use the FPU
(in userland).  Then we take the trap and load the saved context in
npxdna().
happens for SMP, where a process could be scheduled on a different
processor before its FPU state is saved on the first processor.
There is no difference for SMP, but there would be large complicated
differences if we did fully lazy saving.  npxdna() would have to do
something like sending an IPI to the thread that owns the FPU if
that thread could be different from curthread.  This would be slow,
but might be worth doing if it didn't happen much and if lazy fully
lazy context switching were a significant advantage.  I think it
could be arranged to not happen much, but the advantage is insignificant.
BTW, David and I recently found a bug in the context switching in the
fxsr case, at least on 

Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-31 Thread Bruce Evans
On Thu, 31 Mar 2005, Peter Jeremy wrote:
On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote:
 I still
think fully lazy switching (c2) is the best general method.
I think it depends on the FP workload.  It's a definite win if there
is exactly one FP thread - in this case the FPU state never needs to
be saved (and you could even optimise away the DNA trap by clearing
the TS and EM bits if the switched-to curthread is fputhread).
I think stopping the trap would be the usual method (not sure what
Linux did), but to collect statistics for determining affinity you
would want to take the trap anyway.
The worst case is two (or more) FP-intensive threads - in this case,
lazy switching is of no benefit.  The DNA trap overheads mean that
the performance is worse than just saving/restoring the FP state
during a context switch.
My guess is that the current generation workstation is closer to the
second case - current generation graphical bloatware uses a lot of
FP for rendering, not to mention that the idle task has a reasonable
chance of being an FP-intensive distributed computing task (setiathome
or similar).  It's probably time to do some more measuring (I'm not
offering just now, I have lots of other things on my TODO list).
Bloatware might be so hoggish that it rarely makes context switches :-).
Context switches for interrupts increase the problem though, as would
using FP more in the kernel.
BTW, David and I recently found a bug in the context switching in the
fxsr case, at least on Athlon-XP's and AMD64's.
I gather this is not noticable unless the application is doing its
own FPU save/restore.  Is there a solution or work-around?
It's most noticeable for debugging, and if you worry about leaking
thread context.  Fortunately, the last-instruction pointers won't
have real user data in them unless the application encodes it there
intentionally.  I can't see any efficent solution or workaround.
The kernel should do a full save/restore for processes being debugged.
For applications, the bug seems to be larger.  Even if they know about
the amd behaviour and do a full save/restore because they need it, it
won't work because the kernel doesn't preserve the state across
context switches.  Applications like vmware might care more than most.
I forgot to mention that we couldn't find anything in intel manuals
about this behaviour, so it might be completely amd-specific.  Also,
the instruction pointers are fundamentally broken for 64-bit CPUs,
since although they are 64 bits, they have the segment selector encoded
in their top 32 bits, so they are not really different from the 32:32
selector:pointer format for the non-fxsr case.  Their format is specified
by SSE2 so 64-bit extensions would have to be elsewhere, but amd64 doesn't
seem to extend them.
Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-31 Thread Matthew Dillon
All I really did was implement a comment that DG had made many years
ago in the PCB structure about making the FPU save area a pointer rather
then hardwiring it into the PCB.

This greatly reduces the complexity of work required to allow
the kernel to 'borrow' the FPU.   It basically allows the kernel
to 'stack' save contexts rather then swap-out save contexts.  The
result is that the cross-over point for the copy size where the FPU
becomes economical is a much lower value (~2K rather then ~4-8K).  The
FPU overhead differences between DFly and FreeBSD for bcopy only matters
for buffers between 2K and 16K in size.  After that the copy itself 
overshadows the FPU setup overhead.

In DFly the kernel must still check to see whether userland has used
the FPU and save the state before it reuses the FPU in the kernel.
We don't bother to restore the state, we simply allow userland to take
another fault (the idea being that if userland is making several I/O
calls into the kernel in a batch, the FPU state is only saved once).

Once the kernel has done this and adjusted the FPU save area it can
use the FPU at a whim, even though blocking conditions, and then just
throw away the FPU context when it is done.  We could theoretically 
stack multiple kernel FPU contexts through this mechanism but I don't
see much advantage to it so I don't... I have a lockout bit so if the
kernel is already using the FPU and takes e.g. a preemptive interrupt,
it doesn't go and use the FPU within that preemption.

The use of the XMM registers is a cpu optimization.  Modern CPUs,
especially AMD Athlon and Opterons, are more efficient with 128 bit 
moves then with 64 bit moves.   I experimented with all sorts of 
configurations, including the use of special data caching instructions,
but they had so many special cases and degenerate conditions that
I found that simply using straight XMM instructions, reading as big
a glob as possible, then writing the glob, was by far the best solution.

The key for fast block copying is to not issue any memory writes other
then those related directly to the data being copied.  This avoids
unnecessary RAS cycles which would otherwise kill copying performance.
In tests I found that copying multi-page blocks in a single loop was
far more efficient then copying data page-by-page precisely because 
page-by-page copying was too complex to be able to avoid extranious
writes to memory unrelated to the target buffer inbetween each page copy.

-Matt

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-30 Thread Peter Jeremy
On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote:
Later in that thread they discuss skipping the restore state to make 
things faster.  The minimum buffer size they say this will be good for 
is between 2-4k.  Does this make sense, or am I showing my ignorance?

http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html

Yes.  There are a variety of options for saving/restoring FPU state:
a) save FPU state on kernel entry
b) save FPU state on a context switch (or if the kernel wants the FPU)
c) only save FPU state if a different process (or the kernel) wants the FPU
1) restore FPU on kernel exit
2) restore FPU state if a process wants the FPU.

a and 1 are the most obvious - that's the way the integer registers are
handled.

I thought FreeBSD used to be c2 but it seems it has been changed to b2
since I looked last.

Based on the mail above, it looks like Dfly was changed from 1 to 2
(I'm not sure if it's 'a' or 'c' on save).

On the i386 (and probably most other CPUs), you can place the FPU into
am unavailable state.  This means that any attempt to use it will
trigger a trap.  The kernel will then restore FPU state and return.
On a normal system call, if the FPU hasn't been used, the kernel will
see that it's still in an unavailable state and can avoid saving the
state.  (On an i386, unavailable state is achieved by either setting
CR0_TS or CR0_EM).  This means you avoid having to always restore FPU
state at the expense of an additional trap if the process actually
uses the FPU.

-- 
Peter Jeremy
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-30 Thread David Schultz
On Wed, Mar 30, 2005, Peter Jeremy wrote:
 On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote:
 Later in that thread they discuss skipping the restore state to make 
 things faster.  The minimum buffer size they say this will be good for 
 is between 2-4k.  Does this make sense, or am I showing my ignorance?
 
 http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html
 
 Yes.  There are a variety of options for saving/restoring FPU state:
 a) save FPU state on kernel entry
 b) save FPU state on a context switch (or if the kernel wants the FPU)
 c) only save FPU state if a different process (or the kernel) wants the FPU
 1) restore FPU on kernel exit
 2) restore FPU state if a process wants the FPU.
 
 a and 1 are the most obvious - that's the way the integer registers are
 handled.
 
 I thought FreeBSD used to be c2 but it seems it has been changed to b2
 since I looked last.
 
 Based on the mail above, it looks like Dfly was changed from 1 to 2
 (I'm not sure if it's 'a' or 'c' on save).
 
 On the i386 (and probably most other CPUs), you can place the FPU into
 am unavailable state.  This means that any attempt to use it will
 trigger a trap.  The kernel will then restore FPU state and return.
 On a normal system call, if the FPU hasn't been used, the kernel will
 see that it's still in an unavailable state and can avoid saving the
 state.  (On an i386, unavailable state is achieved by either setting
 CR0_TS or CR0_EM).  This means you avoid having to always restore FPU
 state at the expense of an additional trap if the process actually
 uses the FPU.

This is basically what FreeBSD does on i386 and amd64.  (As a
disclaimer, I haven't read the code very carefully, so I might be
missing some of the details.)  Upon taking a trap for a process
that has never used the FPU before, we save the FPU state for the
last process to use the FPU, then load a fresh FPU state.  On
subsequent context switches, the FPU state for processes that have
already used the FPU gets loaded before entering user mode, I
think.  I haven't studied the code in enough detail to know what
happens for SMP, where a process could be scheduled on a different
processor before its FPU state is saved on the first processor.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-29 Thread Peter Jeremy
On Mon, 2005-Mar-28 23:23:19 -0800, David Leimbach wrote:
meant to send this to the list too... sorry
 Are you implying DragonFly uses FPU/SIMD?  For that matter does any kernel?

I believe it does use SIMD for some of it's fast memcopy stuff for
it's messaging system
actually.  I remember Matt saying he was working on it.

http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00262.html

That's almost a year ago and specifically for the amd64.  Does anyone
know what the results were?

If you can manage the alignment issues it can be a huge win.

For message passing within the kernel, you should be able to mandate
alignment as part of the API.

I see the bigger issue being the need to save/restore the SIMD
engine's state during a system call.  Currently, this is only saved on
if a different process wants to use the SIMD engine.  For MMX, the
SIMD state is the FPU state - which is non-trivial.  The little
reading I've done suggests that SSE and SSE2 are even larger.

Saving the SIMD state would be more expensive that using integer
registers for small (and probably medium-sized) copies.

-- 
Peter Jeremy
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-29 Thread David Malone
On Tue, Mar 29, 2005 at 09:11:07PM +1000, Peter Jeremy wrote:
 That's almost a year ago and specifically for the amd64.  Does anyone
 know what the results were?

I had a quick dig around on cvsweb this morning:


http://grappa.unix-ag.uni-kl.de/cgi-bin/cvsweb/src/sys/i386/i386/bcopy.s?cvsroot=dragonfly


I dunno if Matt has benchmarks for before and after.

David.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-29 Thread Devon H. O'Dell
On Tue, Mar 29, 2005 at 02:12:53PM +0100, David Malone wrote:
 On Tue, Mar 29, 2005 at 09:11:07PM +1000, Peter Jeremy wrote:
  That's almost a year ago and specifically for the amd64.  Does anyone
  know what the results were?
 
 I had a quick dig around on cvsweb this morning:
 
   
 http://grappa.unix-ag.uni-kl.de/cgi-bin/cvsweb/src/sys/i386/i386/bcopy.s?cvsroot=dragonfly
 
 
 I dunno if Matt has benchmarks for before and after.
 
   David.

I believe it was asserted on the list that the modification
doubled the performance. I have not tested this.

--Devon


pgpvyrjtIvlIN.pgp
Description: PGP signature


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-29 Thread David Leimbach
On Tue, 29 Mar 2005 21:11:07 +1000, Peter Jeremy
[EMAIL PROTECTED] wrote:
 On Mon, 2005-Mar-28 23:23:19 -0800, David Leimbach wrote:
 meant to send this to the list too... sorry
  Are you implying DragonFly uses FPU/SIMD?  For that matter does any kernel?
 
 I believe it does use SIMD for some of it's fast memcopy stuff for
 it's messaging system
 actually.  I remember Matt saying he was working on it.
 
 http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00262.html
 
 That's almost a year ago and specifically for the amd64.  Does anyone
 know what the results were?

Actually I don't remember precisely what came of it, but I do remember that 
we had some interesting stability issues while Matt worked out some bugs around
that time, I think they were related to the SIMD stuff.

 
 If you can manage the alignment issues it can be a huge win.
 
 For message passing within the kernel, you should be able to mandate
 alignment as part of the API.
 
 I see the bigger issue being the need to save/restore the SIMD
 engine's state during a system call.  Currently, this is only saved on
 if a different process wants to use the SIMD engine.  For MMX, the
 SIMD state is the FPU state - which is non-trivial.  The little
 reading I've done suggests that SSE and SSE2 are even larger.
 
 Saving the SIMD state would be more expensive that using integer
 registers for small (and probably medium-sized) copies.
 

Yes, you'd have to have a fairly smart copy to know when to avoid the
setup overhead.  Apple's bcopy stuff does a lot of checking if I recall.  
It's been a while since I've looked at that either.  [the stuff that's mapped
into the COMM_PAGE of Mac OS X 10.3.x processes]

Dave

 --
 Peter Jeremy

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fwd: 5-STABLE kernel build with icc broken

2005-03-29 Thread jason henson
Peter Jeremy wrote:
On Mon, 2005-Mar-28 23:23:19 -0800, David Leimbach wrote:
 

meant to send this to the list too... sorry
   

Are you implying DragonFly uses FPU/SIMD?  For that matter does any kernel?
 

I believe it does use SIMD for some of it's fast memcopy stuff for
it's messaging system
actually.  I remember Matt saying he was working on it.
http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00262.html
   

That's almost a year ago and specifically for the amd64.  Does anyone
know what the results were?
 

If you can manage the alignment issues it can be a huge win.
   

For message passing within the kernel, you should be able to mandate
alignment as part of the API.
I see the bigger issue being the need to save/restore the SIMD
engine's state during a system call.  Currently, this is only saved on
if a different process wants to use the SIMD engine.  For MMX, the
SIMD state is the FPU state - which is non-trivial.  The little
reading I've done suggests that SSE and SSE2 are even larger.
Saving the SIMD state would be more expensive that using integer
registers for small (and probably medium-sized) copies.
 

Later in that thread they discuss skipping the restore state to make 
things faster.  The minimum buffer size they say this will be good for 
is between 2-4k.  Does this make sense, or am I showing my ignorance?

http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]