Re: Fwd: 5-STABLE kernel build with icc broken
On Fri, 1 Apr 2005, Matthew Dillon wrote: :>The use of the XMM registers is a cpu optimization. Modern CPUs, :>especially AMD Athlon and Opterons, are more efficient with 128 bit :>moves then with 64 bit moves. I experimented with all sorts of :>configurations, including the use of special data caching instructions, :>but they had so many special cases and degenerate conditions that :>I found that simply using straight XMM instructions, reading as big :>a glob as possible, then writing the glob, was by far the best solution. : :Are you sure about that? The amd64 optimization manual says (essentially) This is in 25112.PDF section 5.16 ("Interleave Loads and Stores", with 128 bits of loads followed by 128 bits of stores). :that big globs are bad, and my benchmarks confirm this. The best glob size :is 128 bits according to my benchmarks. This can be obtained using 2 :... : :Unfortunately (since I want to avoid using both MMX and XMM), I haven't :managed to make copying through 64-integer registers work as well. :Copying 128 bits at a time using 2 pairs of movq's through integer :registers gives only 7.9GB/sec. movq through MMX is never that slow. :However, movdqu through xmm is even slower (7.4GB/sec). I forgot many of my earlier conclusions when I wrote the above. The speeds between 7.4GB/sec and 12.9GB/sec for the fully (L1) cached case are almost irrelevant. They basically just tell how well we have used the instruction bandwidth. Plain movsq uses it better and gets 15.9GB/sec. I believe 15.9GB/sec is from saturating the L1 cache. The CPU is an Athlon64 and its clock frequency is 1994 MHz, and I think the max L1 cache bandwidth is with a 16-byte load and store per cycle; 16*1994*10^6 is 15.95GB/sec (disk manufacturers GB's). Plain movsq is best here for many other cases too... : :The fully cached case is too unrepresentative of normal use, and normal :(partially cached) use is hard to bencmark, so I normally benchmark :the fully uncached case. For that, movnt* is best for benchmarks but :not for general use, and it hardly matters which registers are used. Yah, I'm pretty sure. I tested the fully cached (L1), partially cached (L2), and the fully uncached cases. I don't have a logic By the partially cached case, I meant the case where some of the source and/or target addresses are in the L1 or L2 cache, but you don't really the chance that they are there (or should be there after the copy), so you can only guess the best strategy. analyzer but what I think is happening is that the cpu's write buffer is messing around with the reads and causing extra RAS cycles to occur. I also tested using various combinations of movdqa, movntdq, and prefetcha. Somehow I'm only seeing small variations from different strategies now, with all tests done in userland on an Athlon64 system (and on athlonXP systems for reference). Using XMM or MMX can be twice as fast on the AthlonXPs, but movsq is absolutely the fastest in many cases on the Athlon64, and is < 5% slower than the fastest in all cases (except for the fully uncached case since it can't do nontemporal stores), so it is the best general method. ... I also think there might be some odd instruction pipeline effects that skew the results when only one or two instructions are between the load into an %xmm register and the store from the same register. I tried using 2, 4, and 8 XMM registers. 8 XMM registers seemed to work the best. I'm getting only small variations from different load/store patterns. Of course, I primarily tested on an Athlon 64 3200+, so YMMV. (One of the first Athlon 64's, so it has a 1MB L2 cache). My test system is very similar: %%% CPU: AMD Athlon(tm) 64 Processor 3400+ (1994.33-MHz K8-class CPU) Origin = "AuthenticAMD" Id = 0xf48 Stepping = 8 Features=0x78bfbff AMD Features=0xe0500800 L1 2MB data TLB: 8 entries, fully associative L1 2MB instruction TLB: 8 entries, fully associative L1 4KB data TLB: 32 entries, fully associative L1 4KB instruction TLB: 32 entries, fully associative L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L2 2MB unified TLB: 0 entries, disabled/not present L2 4KB data TLB: 512 entries, 4-way associative L2 4KB instruction TLB: 512 entries, 4-way associative L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative %%% The prefetchnta I have commented out seemed to improve performance, but it requires 3dNOW and I didn't want to NOT have an MMX copy mode for cpu's with MMX but without 3dNOW. Prefetching less then 128 bytes did not help, and prefetching greater then 128 bytes (e.g. 256(%esi)) seemed to cause extra RAS cycles. It was unbelievably finicky, not at all what I expected. Prefetching is showing some very good effects here, but there are MD complications: - the Athlon[32] optimization manu
Re: Fwd: 5-STABLE kernel build with icc broken
Here is the core of the FPU setup and restoration code for the kernel bcopy in DragonFly, from i386/bcopy.s. DragonFly uses the TD_SAVEFPU-is-a-pointer method that was outlined in the original comment in the FreeBSD code. I further enhance the algorithm to guarentee that the FPU is in a sane state (does not require any further initialization other then a clts) if userland has NOT used it. However, there are definitely some race cases that must be considered (see the comments). The on-fault handling in DragonFly is stackable (which further simplifies the whole mess of on-fault vs non-on-fault copying code) and the DFly bcopy just sets up the frame for it whether or not the onfault handling is actually needed. This could be further optimized, but I had already spent at least a month on it and had to move on to other things. In particular, the setting of CR0_TS and the restoration of TD_SAVEFPU could be moved to the syscall-return code, so multiple in-kernel bcopy operations could be issued without any further FPU setup or teardown. -Matt /* * RACES/ALGORITHM: * * If gd_npxthread is not NULL we must save the application's * current FP state to the current save area and then NULL * out gd_npxthread to interlock against new interruptions * changing the FP state further. * * If gd_npxthread is NULL the FP unit is in a known 'safe' * state and may be used once the new save area is installed. * * race(1): If an interrupt occurs just prior to calling fxsave * all that happens is that fxsave gets a npxdna trap, restores * the app's environment, and immediately traps, restores, * and saves it again. * * race(2): No interrupt can safely occur after we NULL-out * npxthread until we fninit, because the kernel assumes that * the FP unit is in a safe state when npxthread is NULL. It's * more convenient to use a cli sequence here (it is not * considered to be in the critical path), but a critical * section would also work. * * race(3): The FP unit is in a known state (because npxthread * was either previously NULL or we saved and init'd and made * it NULL). This is true even if we are preempted and the * preempting thread uses the FP unit, because it will be * fninit's again on return. ANY STATE WE SAVE TO THE FPU MAY * BE DESTROYED BY PREEMPTION WHILE NPXTHREAD IS NULL! However, * an interrupt occuring inbetween clts and the setting of * gd_npxthread may set the TS bit again and cause the next * npxdna() to panic when it sees a non-NULL gd_npxthread. * * We can safely set TD_SAVEFPU to point to a new uninitialized * save area and then set GD_NPXTHREAD to non-NULL. If an * interrupt occurs after we set GD_NPXTHREAD, all that happens * is that the safe FP state gets saved and restored. We do not * need to fninit again. * * We can safely clts after setting up the new save-area, before * installing gd_npxthread, even if we get preempted just after * calling clts. This is because the FP unit will be in a safe * state while gd_npxthread is NULL. Setting gd_npxthread will * simply lock-in that safe-state. Calling clts saves * unnecessary trap overhead since we are about to use the FP * unit anyway and don't need to 'restore' any state prior to * that first use. */ #define MMX_SAVE_BLOCK(missfunc)\ cmpl$2048,%ecx ;\ jb missfunc ; \ movlMYCPU,%eax ;/* EAX = MYCPU */ \ btsl$1,GD_FPU_LOCK(%eax) ; \ jc missfunc ; \ pushl %ebx ; \ pushl %ecx ; \ movlGD_CURTHREAD(%eax),%edx ; /* EDX = CURTHREAD */ \ movlTD_SAVEFPU(%edx),%ebx ; /* save app save area */\ addl$TDPRI_CRIT,TD_PRI(%edx) ; \ cmpl$0,GD_NPXTHREAD(%eax) ; \ je 100f ; \ fxsave 0(%ebx) ; /* race(1) */ \
Re: Fwd: 5-STABLE kernel build with icc broken
:>The use of the XMM registers is a cpu optimization. Modern CPUs, :>especially AMD Athlon and Opterons, are more efficient with 128 bit :>moves then with 64 bit moves. I experimented with all sorts of :>configurations, including the use of special data caching instructions, :>but they had so many special cases and degenerate conditions that :>I found that simply using straight XMM instructions, reading as big :>a glob as possible, then writing the glob, was by far the best solution. : :Are you sure about that? The amd64 optimization manual says (essentially) :that big globs are bad, and my benchmarks confirm this. The best glob size :is 128 bits according to my benchmarks. This can be obtained using 2 :... : :Unfortunately (since I want to avoid using both MMX and XMM), I haven't :managed to make copying through 64-integer registers work as well. :Copying 128 bits at a time using 2 pairs of movq's through integer :registers gives only 7.9GB/sec. movq through MMX is never that slow. :However, movdqu through xmm is even slower (7.4GB/sec). : :The fully cached case is too unrepresentative of normal use, and normal :(partially cached) use is hard to bencmark, so I normally benchmark :the fully uncached case. For that, movnt* is best for benchmarks but :not for general use, and it hardly matters which registers are used. Yah, I'm pretty sure. I tested the fully cached (L1), partially cached (L2), and the fully uncached cases. I don't have a logic analyzer but what I think is happening is that the cpu's write buffer is messing around with the reads and causing extra RAS cycles to occur. I also tested using various combinations of movdqa, movntdq, and prefetcha. carefully arranged non-temporal and/or prefetch instructions were much faster for the uncached case, but much, MUCH slower for the partially cached (L2) or fully (L1) cached case, making them unsuitable for a generic copy. I am rather miffed that AMD screwed up the non-temporal instructions so badly. I also think there might be some odd instruction pipeline effects that skew the results when only one or two instructions are between the load into an %xmm register and the store from the same register. I tried using 2, 4, and 8 XMM registers. 8 XMM registers seemed to work the best. Of course, I primarily tested on an Athlon 64 3200+, so YMMV. (One of the first Athlon 64's, so it has a 1MB L2 cache). :>The key for fast block copying is to not issue any memory writes other :>then those related directly to the data being copied. This avoids :>unnecessary RAS cycles which would otherwise kill copying performance. :>In tests I found that copying multi-page blocks in a single loop was :>far more efficient then copying data page-by-page precisely because :>page-by-page copying was too complex to be able to avoid extranious :>writes to memory unrelated to the target buffer inbetween each page copy. : :By page-by-page, do you mean prefetch a page at a time into the L1 :cache? No, I meant that copying taking, e.g. a vm_page_t array and doing page-by-page mappings copying in 4K chunks seems to be a lot slower then doing a linear mapping of the entire vm_page_t array and doing one big copy. Literally the same code, just rearranged a bit. Just writing to the stack in between each page was enough to throw it off. :I've noticed strange loss (apparently) from extraneous reads or writes :more for benchmarks that do just (very large) writes. An at least old :Celerons and AthlonXPs, the writes go straight to the L1/L2 caches :(unless you use movntq on AthlonXP's). The caches are flushed to main :memory some time later, apparently not very well since some pages take :more than twice as long to write as others (as seen by the writer :filling the caches), and the slow case happens enough to affect the :average write speed by up to 50%. This problem can be reduced by :putting memory bank bits in the page colors. This is hard to get right :even for the simple unrepresentative case of large writes. : :Bruce I've seen the same effects and come to the same conclusion. The copy code I eventually settled on was this (taken from my i386/bcopy.s). It isn't as fast as using movntdq for the fully uncached case, but it seems to perform in the system the best because the data tends to be accessed and in the cache by someone in real life (e.g. source data tends to be in the cache even if the device driver doesn't touch the target data). I wish AMD had made movntdq work the same as movdqa for the case where the data was already in the cache, then movntdq would have been the clear winner. The prefetchnta I have commented out seemed to improve performance, but it requires 3dNOW and I didn't want to NOT have an MMX copy mode for cpu's with MMX but without 3dNOW.
Re: Fwd: 5-STABLE kernel build with icc broken
On Thu, 31 Mar 2005, Matthew Dillon wrote: I didn't mean to get into the kernel's use of the FPU, but... All I really did was implement a comment that DG had made many years ago in the PCB structure about making the FPU save area a pointer rather then hardwiring it into the PCB. ISTR writing something like that. dg committed most of my early work since I didn't have commit access at the time. ... The use of the XMM registers is a cpu optimization. Modern CPUs, especially AMD Athlon and Opterons, are more efficient with 128 bit moves then with 64 bit moves. I experimented with all sorts of configurations, including the use of special data caching instructions, but they had so many special cases and degenerate conditions that I found that simply using straight XMM instructions, reading as big a glob as possible, then writing the glob, was by far the best solution. Are you sure about that? The amd64 optimization manual says (essentially) that big globs are bad, and my benchmarks confirm this. The best glob size is 128 bits according to my benchmarks. This can be obtained using 2 64-bit reads of 64-bit registers followed by 2 64-bit writes of these registers, or by read-write of a single 128-bit register. The 64-bit registers can be either MMX or integer registers on 64-bit systems, but the 128-registers must be XMM on all systems. I get identical speeds of 12.9GB/sec (+-0.1GB/sec) on a fairly old and slow Athlon64 system for copying 16K (fully cached) through MMX and XMM 128 bits at a time using the following instructions: # MMX: # XMM movq(%0),%mm0 movdqa (%0),%xmm0 movq8(%0),%mm1 movdqa %xmm0,(%1) movq%mm0,(%1) ... # unroll same amount movq%mm1,8(%1) ... # unroll to copy 64 bytes per iteration Unfortunately (since I want to avoid using both MMX and XMM), I haven't managed to make copying through 64-integer registers work as well. Copying 128 bits at a time using 2 pairs of movq's through integer registers gives only 7.9GB/sec. movq through MMX is never that slow. However, movdqu through xmm is even slower (7.4GB/sec). The fully cached case is too unrepresentative of normal use, and normal (partially cached) use is hard to bencmark, so I normally benchmark the fully uncached case. For that, movnt* is best for benchmarks but not for general use, and it hardly matters which registers are used. The key for fast block copying is to not issue any memory writes other then those related directly to the data being copied. This avoids unnecessary RAS cycles which would otherwise kill copying performance. In tests I found that copying multi-page blocks in a single loop was far more efficient then copying data page-by-page precisely because page-by-page copying was too complex to be able to avoid extranious writes to memory unrelated to the target buffer inbetween each page copy. By page-by-page, do you mean prefetch a page at a time into the L1 cache? I've noticed strange loss (apparently) from extraneous reads or writes more for benchmarks that do just (very large) writes. An at least old Celerons and AthlonXPs, the writes go straight to the L1/L2 caches (unless you use movntq on AthlonXP's). The caches are flushed to main memory some time later, apparently not very well since some pages take more than twice as long to write as others (as seen by the writer filling the caches), and the slow case happens enough to affect the average write speed by up to 50%. This problem can be reduced by putting memory bank bits in the page colors. This is hard to get right even for the simple unrepresentative case of large writes. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
All I really did was implement a comment that DG had made many years ago in the PCB structure about making the FPU save area a pointer rather then hardwiring it into the PCB. This greatly reduces the complexity of work required to allow the kernel to 'borrow' the FPU. It basically allows the kernel to 'stack' save contexts rather then swap-out save contexts. The result is that the cross-over point for the copy size where the FPU becomes economical is a much lower value (~2K rather then ~4-8K). The FPU overhead differences between DFly and FreeBSD for bcopy only matters for buffers between 2K and 16K in size. After that the copy itself overshadows the FPU setup overhead. In DFly the kernel must still check to see whether userland has used the FPU and save the state before it reuses the FPU in the kernel. We don't bother to restore the state, we simply allow userland to take another fault (the idea being that if userland is making several I/O calls into the kernel in a batch, the FPU state is only saved once). Once the kernel has done this and adjusted the FPU save area it can use the FPU at a whim, even though blocking conditions, and then just throw away the FPU context when it is done. We could theoretically stack multiple kernel FPU contexts through this mechanism but I don't see much advantage to it so I don't... I have a lockout bit so if the kernel is already using the FPU and takes e.g. a preemptive interrupt, it doesn't go and use the FPU within that preemption. The use of the XMM registers is a cpu optimization. Modern CPUs, especially AMD Athlon and Opterons, are more efficient with 128 bit moves then with 64 bit moves. I experimented with all sorts of configurations, including the use of special data caching instructions, but they had so many special cases and degenerate conditions that I found that simply using straight XMM instructions, reading as big a glob as possible, then writing the glob, was by far the best solution. The key for fast block copying is to not issue any memory writes other then those related directly to the data being copied. This avoids unnecessary RAS cycles which would otherwise kill copying performance. In tests I found that copying multi-page blocks in a single loop was far more efficient then copying data page-by-page precisely because page-by-page copying was too complex to be able to avoid extranious writes to memory unrelated to the target buffer inbetween each page copy. -Matt ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
On Thu, 31 Mar 2005, Peter Jeremy wrote: On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote: I still think fully lazy switching (c2) is the best general method. I think it depends on the FP workload. It's a definite win if there is exactly one FP thread - in this case the FPU state never needs to be saved (and you could even optimise away the DNA trap by clearing the TS and EM bits if the switched-to curthread is fputhread). I think stopping the trap would be the usual method (not sure what Linux did), but to collect statistics for determining affinity you would want to take the trap anyway. The worst case is two (or more) FP-intensive threads - in this case, lazy switching is of no benefit. The DNA trap overheads mean that the performance is worse than just saving/restoring the FP state during a context switch. My guess is that the current generation workstation is closer to the second case - current generation graphical bloatware uses a lot of FP for rendering, not to mention that the idle task has a reasonable chance of being an FP-intensive distributed computing task (setiathome or similar). It's probably time to do some more measuring (I'm not offering just now, I have lots of other things on my TODO list). Bloatware might be so hoggish that it rarely makes context switches :-). Context switches for interrupts increase the problem though, as would using FP more in the kernel. BTW, David and I recently found a bug in the context switching in the fxsr case, at least on Athlon-XP's and AMD64's. I gather this is not noticable unless the application is doing its own FPU save/restore. Is there a solution or work-around? It's most noticeable for debugging, and if you worry about leaking thread context. Fortunately, the last-instruction pointers won't have real user data in them unless the application encodes it there intentionally. I can't see any efficent solution or workaround. The kernel should do a full save/restore for processes being debugged. For applications, the bug seems to be larger. Even if they know about the amd behaviour and do a full save/restore because they need it, it won't work because the kernel doesn't preserve the state across context switches. Applications like vmware might care more than most. I forgot to mention that we couldn't find anything in intel manuals about this behaviour, so it might be completely amd-specific. Also, the instruction pointers are fundamentally broken for 64-bit CPUs, since although they are 64 bits, they have the segment selector encoded in their top 32 bits, so they are not really different from the 32:32 selector:pointer format for the non-fxsr case. Their format is specified by SSE2 so 64-bit extensions would have to be elsewhere, but amd64 doesn't seem to extend them. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
On Wed, 30 Mar 2005, David Schultz wrote: On Wed, Mar 30, 2005, Peter Jeremy wrote: On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote: Later in that thread they discuss skipping the restore state to make things faster. The minimum buffer size they say this will be good for is between 2-4k. Does this make sense, or am I showing my ignorance? http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html Yes. There are a variety of options for saving/restoring FPU state: a) save FPU state on kernel entry b) save FPU state on a context switch (or if the kernel wants the FPU) c) only save FPU state if a different process (or the kernel) wants the FPU 1) restore FPU on kernel exit 2) restore FPU state if a process wants the FPU. a and 1 are the most obvious - that's the way the integer registers are handled. I thought FreeBSD used to be c2 but it seems it has been changed to b2 since I looked last. No, it always used b2. I never got around to implementing c2. Linux used to implement c2 on i386's, but I think it switched (to b2?) to optimize (or at least simplify) the SMP case. Based on the mail above, it looks like Dfly was changed from 1 to 2 (I'm not sure if it's 'a' or 'c' on save). 'a' seems to be too inefficient to ever use. '1' makes sense if it rarely happens and/or the kernel can often use the FPU more than once per entry (which it probably shouldn't), but it gives complications like the ones for SMP, especially in FreeBSD where the kernel can be preempted. Saving FP state as needed is simplest but can be slow. My Athlon-with- SSE-extensions pagecopy and pagezero routines use the FPU (XMM) but their FP state save isn't slow because only 1 or 2 XMM registers needs to be saved. E.g., the saving part of sse_pagezero_for_some_athlons() is: pushfl # Also have to save %eflags. cli # Switch %eflags as needed to safely use FPU. movl%cr0,%eax # Also have to save %cr0. clts# Switch %cr0 as needed to use FPU. subl$16,%esp# Space to save some FP state. movups %xmm0,(%esp)# Save some FP state. Only this much needed. On the i386 (and probably most other CPUs), you can place the FPU into am "unavailable" state. This means that any attempt to use it will trigger a trap. The kernel will then restore FPU state and return. On a normal system call, if the FPU hasn't been used, the kernel will see that it's still in an "unavailable" state and can avoid saving the state. (On an i386, "unavailable" state is achieved by either setting CR0_TS or CR0_EM). This means you avoid having to always restore FPU state at the expense of an additional trap if the process actually uses the FPU. I remember that you (Peter) did extensive benchmarks of this. I still think fully lazy switching (c2) is the best general method. Maybe FP state should be loaded in advance based on FPU affinity. It might be good for CPU affinity to depend on FPU use (prfer not to switch threads away from a CPU if they own that CPU via its FPU). This is basically what FreeBSD does on i386 and amd64. (As a disclaimer, I haven't read the code very carefully, so I might be missing some of the details.) Upon taking a trap for a process that has never used the FPU before, we save the FPU state for the last process to use the FPU, then load a fresh FPU state. On We don't save the FPU state for the last thread then (c2 behaviour) since we have already saved it it when we switched away from it. npxdna() panics if we haven't done that. Except rev.1.131 added bogus code (apparently to debug or hide bugs in the other changes in rev.1.131) that breaks the panic in the fpcurthread == curthread case. subsequent context switches, the FPU state for processes that have already used the FPU gets loaded before entering user mode, I think. I haven't studied the code in enough detail to know what No, that doesn't happen. Instead, cpu_switch() has called npxsave() on the context switch away from the thread. npxsave() arranges for a trap on the next use of the FPU, and we don't do anything more with the FPU context of the thread until the thread tries to use the FPU (in userland). Then we take the trap and load the saved context in npxdna(). happens for SMP, where a process could be scheduled on a different processor before its FPU state is saved on the first processor. There is no difference for SMP, but there would be large complicated differences if we did fully lazy saving. npxdna() would have to do something like sending an IPI to the thread that owns the FPU if that thread could be different from curthread. This would be slow, but might be worth doing if it didn't happen much and if lazy fully lazy context switching were a significant advantage. I think it could be arranged to not happen much, but the advantage is insignificant. BTW, David and I recently found a bug in the context switching in the fxsr case, at leas
Re: Fwd: 5-STABLE kernel build with icc broken
On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote: >>>On the i386 (and probably most other CPUs), you can place the FPU into >>>am "unavailable" state. This means that any attempt to use it will >>>trigger a trap. The kernel will then restore FPU state and return. >>>On a normal system call, if the FPU hasn't been used, the kernel will >>>see that it's still in an "unavailable" state and can avoid saving the >>>state. (On an i386, "unavailable" state is achieved by either setting >>>CR0_TS or CR0_EM). This means you avoid having to always restore FPU >>>state at the expense of an additional trap if the process actually >>>uses the FPU. > >I remember that you (Peter) did extensive benchmarks of this. That was a long time ago and I don't recall them being that extensive. I suspect the results are in my archives at work - I can't quickly find them here. From memory the tests were on 2.2 and just counted the number of context switches, FP saves and restores. > I still >think fully lazy switching (c2) is the best general method. I think it depends on the FP workload. It's a definite win if there is exactly one FP thread - in this case the FPU state never needs to be saved (and you could even optimise away the DNA trap by clearing the TS and EM bits if the switched-to curthread is fputhread). The worst case is two (or more) FP-intensive threads - in this case, lazy switching is of no benefit. The DNA trap overheads mean that the performance is worse than just saving/restoring the FP state during a context switch. My guess is that the current generation workstation is closer to the second case - current generation graphical bloatware uses a lot of FP for rendering, not to mention that the idle task has a reasonable chance of being an FP-intensive distributed computing task (setiathome or similar). It's probably time to do some more measuring (I'm not offering just now, I have lots of other things on my TODO list). SMP adds a whole new can of worms. (I originally suspected that lazy switching had been lost during the SMP transition). Given CPU (FPU) affinity, you can just add "per CPU" to the above but I'm not sure that changes my conclusion. > Maybe FP state should be loaded in advance based on FPU affinity. Pre-loading the FPU state is an advantage for FP-intensive threads - if the thread will definitely use the FPU before the next context switch, you save the cost of a DNA trap by pre-loading the FPU state. > It might be >good for CPU affinity to depend on FPU use (prfer not to switch >threads away from a CPU if they own that CPU via its FPU). FPU affinity is only an advantage if full lazy switching is implemented. (And I thought we didn't even have CPU affinity working well). The first step is probably gathering some data on whether lazy switching is any benefit. >BTW, David and I recently found a bug in the context switching in the >fxsr case, at least on Athlon-XP's and AMD64's. I gather this is not noticable unless the application is doing its own FPU save/restore. Is there a solution or work-around? -- Peter Jeremy ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
On Wed, Mar 30, 2005, Peter Jeremy wrote: > On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote: > >Later in that thread they discuss skipping the restore state to make > >things faster. The minimum buffer size they say this will be good for > >is between 2-4k. Does this make sense, or am I showing my ignorance? > > > >http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html > > Yes. There are a variety of options for saving/restoring FPU state: > a) save FPU state on kernel entry > b) save FPU state on a context switch (or if the kernel wants the FPU) > c) only save FPU state if a different process (or the kernel) wants the FPU > 1) restore FPU on kernel exit > 2) restore FPU state if a process wants the FPU. > > a and 1 are the most obvious - that's the way the integer registers are > handled. > > I thought FreeBSD used to be c2 but it seems it has been changed to b2 > since I looked last. > > Based on the mail above, it looks like Dfly was changed from 1 to 2 > (I'm not sure if it's 'a' or 'c' on save). > > On the i386 (and probably most other CPUs), you can place the FPU into > am "unavailable" state. This means that any attempt to use it will > trigger a trap. The kernel will then restore FPU state and return. > On a normal system call, if the FPU hasn't been used, the kernel will > see that it's still in an "unavailable" state and can avoid saving the > state. (On an i386, "unavailable" state is achieved by either setting > CR0_TS or CR0_EM). This means you avoid having to always restore FPU > state at the expense of an additional trap if the process actually > uses the FPU. This is basically what FreeBSD does on i386 and amd64. (As a disclaimer, I haven't read the code very carefully, so I might be missing some of the details.) Upon taking a trap for a process that has never used the FPU before, we save the FPU state for the last process to use the FPU, then load a fresh FPU state. On subsequent context switches, the FPU state for processes that have already used the FPU gets loaded before entering user mode, I think. I haven't studied the code in enough detail to know what happens for SMP, where a process could be scheduled on a different processor before its FPU state is saved on the first processor. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote: >Later in that thread they discuss skipping the restore state to make >things faster. The minimum buffer size they say this will be good for >is between 2-4k. Does this make sense, or am I showing my ignorance? > >http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html Yes. There are a variety of options for saving/restoring FPU state: a) save FPU state on kernel entry b) save FPU state on a context switch (or if the kernel wants the FPU) c) only save FPU state if a different process (or the kernel) wants the FPU 1) restore FPU on kernel exit 2) restore FPU state if a process wants the FPU. a and 1 are the most obvious - that's the way the integer registers are handled. I thought FreeBSD used to be c2 but it seems it has been changed to b2 since I looked last. Based on the mail above, it looks like Dfly was changed from 1 to 2 (I'm not sure if it's 'a' or 'c' on save). On the i386 (and probably most other CPUs), you can place the FPU into am "unavailable" state. This means that any attempt to use it will trigger a trap. The kernel will then restore FPU state and return. On a normal system call, if the FPU hasn't been used, the kernel will see that it's still in an "unavailable" state and can avoid saving the state. (On an i386, "unavailable" state is achieved by either setting CR0_TS or CR0_EM). This means you avoid having to always restore FPU state at the expense of an additional trap if the process actually uses the FPU. -- Peter Jeremy ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
Peter Jeremy wrote: On Mon, 2005-Mar-28 23:23:19 -0800, David Leimbach wrote: meant to send this to the list too... sorry Are you implying DragonFly uses FPU/SIMD? For that matter does any kernel? I believe it does use SIMD for some of it's fast memcopy stuff for it's messaging system actually. I remember Matt saying he was working on it. http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00262.html That's almost a year ago and specifically for the amd64. Does anyone know what the results were? If you can manage the alignment issues it can be a huge win. For message passing within the kernel, you should be able to mandate alignment as part of the API. I see the bigger issue being the need to save/restore the SIMD engine's state during a system call. Currently, this is only saved on if a different process wants to use the SIMD engine. For MMX, the SIMD state is the FPU state - which is non-trivial. The little reading I've done suggests that SSE and SSE2 are even larger. Saving the SIMD state would be more expensive that using integer registers for small (and probably medium-sized) copies. Later in that thread they discuss skipping the restore state to make things faster. The minimum buffer size they say this will be good for is between 2-4k. Does this make sense, or am I showing my ignorance? http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
On Tue, 29 Mar 2005 21:11:07 +1000, Peter Jeremy <[EMAIL PROTECTED]> wrote: > On Mon, 2005-Mar-28 23:23:19 -0800, David Leimbach wrote: > >meant to send this to the list too... sorry > >> Are you implying DragonFly uses FPU/SIMD? For that matter does any kernel? > > > >I believe it does use SIMD for some of it's fast memcopy stuff for > >it's messaging system > >actually. I remember Matt saying he was working on it. > > > >http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00262.html > > That's almost a year ago and specifically for the amd64. Does anyone > know what the results were? Actually I don't remember precisely what came of it, but I do remember that we had some interesting stability issues while Matt worked out some bugs around that time, I think they were related to the SIMD stuff. > > >If you can manage the alignment issues it can be a huge win. > > For message passing within the kernel, you should be able to mandate > alignment as part of the API. > > I see the bigger issue being the need to save/restore the SIMD > engine's state during a system call. Currently, this is only saved on > if a different process wants to use the SIMD engine. For MMX, the > SIMD state is the FPU state - which is non-trivial. The little > reading I've done suggests that SSE and SSE2 are even larger. > > Saving the SIMD state would be more expensive that using integer > registers for small (and probably medium-sized) copies. > Yes, you'd have to have a fairly smart copy to know when to avoid the setup overhead. Apple's bcopy stuff does a lot of checking if I recall. It's been a while since I've looked at that either. [the stuff that's mapped into the COMM_PAGE of Mac OS X 10.3.x processes] Dave > -- > Peter Jeremy > ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
On Tue, Mar 29, 2005 at 02:12:53PM +0100, David Malone wrote: > On Tue, Mar 29, 2005 at 09:11:07PM +1000, Peter Jeremy wrote: > > That's almost a year ago and specifically for the amd64. Does anyone > > know what the results were? > > I had a quick dig around on cvsweb this morning: > > > http://grappa.unix-ag.uni-kl.de/cgi-bin/cvsweb/src/sys/i386/i386/bcopy.s?cvsroot=dragonfly > > > I dunno if Matt has benchmarks for before and after. > > David. I believe it was asserted on the list that the modification doubled the performance. I have not tested this. --Devon pgpvyrjtIvlIN.pgp Description: PGP signature
Re: Fwd: 5-STABLE kernel build with icc broken
On Tue, Mar 29, 2005 at 09:11:07PM +1000, Peter Jeremy wrote: > That's almost a year ago and specifically for the amd64. Does anyone > know what the results were? I had a quick dig around on cvsweb this morning: http://grappa.unix-ag.uni-kl.de/cgi-bin/cvsweb/src/sys/i386/i386/bcopy.s?cvsroot=dragonfly I dunno if Matt has benchmarks for before and after. David. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
jason henson <[EMAIL PROTECTED]> wrote: >Various: > - auto-vectorizer (no benefit for the kernel, since we can't use > FPU/SIMD instructions at any time... yet (interested hackers can > have a look how DragonFly handles it, I can provide the necessary > commit logs)) > > Are you implying DragonFly uses FPU/SIMD? For that matter does any kernel? AFAIK DragonFly _allows_ code to use the FPU/SIMD in the kernel. And AFAIR they use SIMD in b{copy,zero} (we do this too, but we do this is in a "controlled environment" whereas DFly just allows the use of FPU/SIMD in an "use as you like" manner everywhere). Bye, Alexander. -- http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 Closet extrovert. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Fwd: 5-STABLE kernel build with icc broken
On Mon, 2005-Mar-28 23:23:19 -0800, David Leimbach wrote: >meant to send this to the list too... sorry >> Are you implying DragonFly uses FPU/SIMD? For that matter does any kernel? > >I believe it does use SIMD for some of it's fast memcopy stuff for >it's messaging system >actually. I remember Matt saying he was working on it. > >http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00262.html That's almost a year ago and specifically for the amd64. Does anyone know what the results were? >If you can manage the alignment issues it can be a huge win. For message passing within the kernel, you should be able to mandate alignment as part of the API. I see the bigger issue being the need to save/restore the SIMD engine's state during a system call. Currently, this is only saved on if a different process wants to use the SIMD engine. For MMX, the SIMD state is the FPU state - which is non-trivial. The little reading I've done suggests that SSE and SSE2 are even larger. Saving the SIMD state would be more expensive that using integer registers for small (and probably medium-sized) copies. -- Peter Jeremy ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Fwd: 5-STABLE kernel build with icc broken
meant to send this to the list too... sorry > Are you implying DragonFly uses FPU/SIMD? For that matter does any kernel? I believe it does use SIMD for some of it's fast memcopy stuff for it's messaging system actually. I remember Matt saying he was working on it. http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00262.html If you can manage the alignment issues it can be a huge win. Dave > > Thanks, > jason > > > - optimizations for Intel CPUs direct from the manufacturer of the CPU > > (they have a lot of interest to produce very fast code) > > - a different set of compiler warnings > > - better code quality (if is compilable by more than one compiler it > > may be more portable) > > > >Icc already pointed out some bad code (asm code in the IP checksumming > >code... DragonFly changed it already), and the panic as noticed above > >may also be an indication that we have some code in the tree which > >smells bad. > > > >Bye, > >Alexander. > > > > > > > > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "[EMAIL PROTECTED]" > ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
Alexander Leidinger wrote: On Sun, 27 Mar 2005 05:40:44 -0800 Avleen Vig <[EMAIL PROTECTED]> wrote: On Sun, Mar 27, 2005 at 01:30:59PM +0200, Alexander Leidinger wrote: It seems to me that building kernel with icc is currently broken, at least in 5-STABLE. Could somebody investigate this? I don't have a problem to compile it with a recent -current and a recent icc (-stable not tested), but the resulting kernel imediatly panics (page fault in _mtx_...()). Without intending to start any compiler holy wars, what benefits does ICC provide over GCC for the end user? Various: - auto-vectorizer (no benefit for the kernel, since we can't use FPU/SIMD instructions at any time... yet (interested hackers can have a look how DragonFly handles it, I can provide the necessary commit logs)) Are you implying DragonFly uses FPU/SIMD? For that matter does any kernel? Thanks, jason - optimizations for Intel CPUs direct from the manufacturer of the CPU (they have a lot of interest to produce very fast code) - a different set of compiler warnings - better code quality (if is compilable by more than one compiler it may be more portable) Icc already pointed out some bad code (asm code in the IP checksumming code... DragonFly changed it already), and the panic as noticed above may also be an indication that we have some code in the tree which smells bad. Bye, Alexander. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
On Sun, 27 Mar 2005 05:40:44 -0800 Avleen Vig <[EMAIL PROTECTED]> wrote: > On Sun, Mar 27, 2005 at 01:30:59PM +0200, Alexander Leidinger wrote: > > > It seems to me that building kernel with icc is currently broken, at > > > least in 5-STABLE. Could somebody investigate this? > > > > I don't have a problem to compile it with a recent -current and a recent > > icc (-stable not tested), but the resulting kernel imediatly panics > > (page fault in _mtx_...()). > > Without intending to start any compiler holy wars, what benefits does > ICC provide over GCC for the end user? Various: - auto-vectorizer (no benefit for the kernel, since we can't use FPU/SIMD instructions at any time... yet (interested hackers can have a look how DragonFly handles it, I can provide the necessary commit logs)) - optimizations for Intel CPUs direct from the manufacturer of the CPU (they have a lot of interest to produce very fast code) - a different set of compiler warnings - better code quality (if is compilable by more than one compiler it may be more portable) Icc already pointed out some bad code (asm code in the IP checksumming code... DragonFly changed it already), and the panic as noticed above may also be an indication that we have some code in the tree which smells bad. Bye, Alexander. -- The dark ages were caused by the Y1K problem. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
On Sun, Mar 27, 2005, c0ldbyte wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On Sun, 27 Mar 2005 [EMAIL PROTECTED] wrote: > > >> > >>Without intending to start any compiler holy wars, what benefits does > >>ICC provide over GCC for the end user? > >> > > > >ICC would provide better low level code (remind: Intel C Compiler. It would > >mean better performance). > > > >rookie > > > > If any, still produces not all that much of a difference of code between > the newer gcc34 and as much performance differance as your going to get > isnt going to even be noticeable in the long run. Your just setting your > self up for failure with something that isnt really going to give you > the desired effects. For some applications, particularly in scientific computing, icc is significantly better. The FreeBSD kernel is not in this category, however. Operating system kernels tend to spend most of their time chasing pointers and copying data, and compilers can't really optimize these operations. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
c0ldbyte wrote: PS: There is coders from Intel that do work on some of the code for gcc34. Wow. As far as I know, there are some coders from Nominum who do (or did) work on bind9. And? Bind9 is at least 10 times slower on FreeBSD than Nominum's CNS. :( I didn't get your point. -- Attila Nagy e-mail: [EMAIL PROTECTED] Free Software Network (FSN.HU) phone @work: +361 371 3536 ISOs: http://www.fsn.hu/?f=downloadcell.: +3630 306 6758 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
c0ldbyte wrote: If any, still produces not all that much of a difference of code between the newer gcc34 and as much performance differance as your going to get isnt going to even be noticeable in the long run. Your just setting your self up for failure with something that isnt really going to give you the desired effects. You don't have to use it, but it is good if you *can*. I guess besides that having a cleaner code base which compiles not only with exactly one compiler it is always good to have the ability to try something else out. BTW, on my humble Pentium II server I noticed significant speedups, compared to the system compiler. But that's purely empirical. -- Attila Nagy e-mail: [EMAIL PROTECTED] Free Software Network (FSN.HU) phone @work: +361 371 3536 ISOs: http://www.fsn.hu/?f=downloadcell.: +3630 306 6758 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, 27 Mar 2005, c0ldbyte wrote: On Sun, 27 Mar 2005 [EMAIL PROTECTED] wrote: Without intending to start any compiler holy wars, what benefits does ICC provide over GCC for the end user? ICC would provide better low level code (remind: Intel C Compiler. It would mean better performance). rookie If any, still produces not all that much of a difference of code between the newer gcc34 and as much performance differance as your going to get isnt going to even be noticeable in the long run. Your just setting your self up for failure with something that isnt really going to give you the desired effects. -- Best regards, --c0ldbyte PS: There is coders from Intel that do work on some of the code for gcc34. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.0 (FreeBSD) Comment: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xF7DF979F iD8DBQFCRwlhsmFQuvffl58RAq83AJsGKYklfVtdxeT8UcIcJ21TaqAmiQCfY6Fz JhQgmTHP66gd6ySeo0zueHc= =RrMC -END PGP SIGNATURE- ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, 27 Mar 2005 [EMAIL PROTECTED] wrote: Without intending to start any compiler holy wars, what benefits does ICC provide over GCC for the end user? ICC would provide better low level code (remind: Intel C Compiler. It would mean better performance). rookie If any, still produces not all that much of a difference of code between the newer gcc34 and as much performance differance as your going to get isnt going to even be noticeable in the long run. Your just setting your self up for failure with something that isnt really going to give you the desired effects. - -- Best regards, --c0ldbyte -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.0 (FreeBSD) Comment: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xF7DF979F iD8DBQFCRwjgsmFQuvffl58RAoqKAJ44D4TFVVaHgK2bP7rrKV0cLHBGlQCeJauB ajI0mxvPps7e/l9dU14DMMU= =73/q -END PGP SIGNATURE- ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
> >Without intending to start any compiler holy wars, what benefits does >ICC provide over GCC for the end user? > ICC would provide better low level code (remind: Intel C Compiler. It would mean better performance). rookie ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
On Sun, Mar 27, 2005 at 01:30:59PM +0200, Alexander Leidinger wrote: > > It seems to me that building kernel with icc is currently broken, at > > least in 5-STABLE. Could somebody investigate this? > > I don't have a problem to compile it with a recent -current and a recent > icc (-stable not tested), but the resulting kernel imediatly panics > (page fault in _mtx_...()). Without intending to start any compiler holy wars, what benefits does ICC provide over GCC for the end user? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 5-STABLE kernel build with icc broken
On Sat, 19 Mar 2005 13:06:29 +0100 Attila Nagy <[EMAIL PROTECTED]> wrote: > It seems to me that building kernel with icc is currently broken, at > least in 5-STABLE. Could somebody investigate this? I don't have a problem to compile it with a recent -current and a recent icc (-stable not tested), but the resulting kernel imediatly panics (page fault in _mtx_...()). Bye, Alexander. -- It's not a bug, it's tradition! http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
5-STABLE kernel build with icc broken
Hello, It seems to me that building kernel with icc is currently broken, at least in 5-STABLE. Could somebody investigate this? grep ^C /etc/make.conf CC=icc CXX=icpc icc -V Intel(R) C Compiler for 32-bit applications, Version 8.1Build 20041118Z Package ID: l_cc_pc_8.1.026 Copyright (C) 1985-2004 Intel Corporation. All rights reserved. FOR NON-COMMERCIAL USE ONLY make buildkernel KERNCONF=GENERIC [...] -- >>> stage 3.2: building everything -- cd /usr/obj/usr/src/sys/GENERIC; MAKEOBJDIRPREFIX=/usr/obj MACHINE_ARCH=i386MACHINE=i386 CPUTYPE= GROFF_BIN_PATH=/usr/obj/usr/src/i386/legacy/usr/bin GROFF_FONT_PATH=/usr/obj/usr/src/i386/legacy/usr/share/groff_font GROFF_TMAC_PATH=/usr/obj/usr/src/i386/legacy/usr/share/tmac _SHLIBDIRPREFIX=/usr/obj/usr/src/i386 INSTALL="sh /usr/src/tools/install.sh" PATH=/usr/obj/usr/src/i386/legacy/usr/sbin:/usr/obj/usr/src/i386/legacy/usr/bin:/usr/obj/usr/src/i386/legacy/usr/games:/usr/obj/usr/src/i386/usr/sbin:/usr/obj/usr/src/i386/usr/bin:/usr/obj/usr/src/i386/usr/games:/sbin:/bin:/usr/sbin:/usr/bin make KERNEL=kernel all -DNO_MODULES_OBJ icc -c -x assembler-with-cpp -DLOCORE -O-X -I- -I. -I/usr/src/sys -I/usr/src/sys/contrib/dev/acpica -I/usr/src/sys/contrib/altq -I/usr/src/sys/contrib/ipfilter -I/usr/src/sys/contrib/pf -I/usr/src/sys/contrib/dev/ath -I/usr/src/sys/contrib/dev/ath/freebsd -I/usr/src/sys/contrib/ngatm -D_KERNEL -include opt_global.h -nolib_inline -restrict /usr/src/sys/i386/i386/locore.s :4:1: warning: "__SIZE_TYPE__" redefined :6:1: warning: this is the location of the previous definition :5:1: warning: "__WCHAR_TYPE__" redefined :8:1: warning: this is the location of the previous definition :10:1: warning: "__GNUC__" redefined :3:1: warning: this is the location of the previous definition :11:1: warning: "__GNUC_MINOR__" redefined :4:1: warning: this is the location of the previous definition :12:1: warning: "__GNUC_PATCHLEVEL__" redefined :5:1: warning: this is the location of the previous definition :15:1: warning: "__GXX_ABI_VERSION" redefined :10:1: warning: this is the location of the previous definition /tmp/iccbinuMzQeKs: Assembler messages: /tmp/iccbinuMzQeKs:491: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:491: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:499: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:500: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:500: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:528: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:529: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:529: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:532: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:532: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:537: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:537: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:542: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:542: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:547: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:547: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:553: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:559: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:563: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:574: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:581: Error: suffix or operands invalid for `shr' /tmp/iccbinuMzQeKs:583: Error: suffix or operands invalid for `shl' /tmp/iccbinuMzQeKs:596: Error: suffix or operands invalid for `shl' *** Error code 1 Stop in /usr/obj/usr/src/sys/GENERIC. *** Error code 1 Stop in /usr/src. *** Error code 1 Stop in /usr/src. Thank you! -- Attila Nagy e-mail: [EMAIL PROTECTED] Adopt a directory on our free software phone @work: +361 371 3536 server! http://www.fsn.hu/?f=brick cell.: +3630 306 6758 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"