Re: LOCK overheads (was Re: objtrm problem probably found)
A bit late, but some more data points. 90MHz Pentium, FreeBSD 2.2.7 mode 0 60.80 ns/loop nproc=1 lcks=EMPTY mode 1 91.13 ns/loop nproc=1 lcks=no mode 2 91.11 ns/loop nproc=2 lcks=no mode 3 242.59 ns/loop nproc=1 lcks=yes mode 4 242.69 ns/loop nproc=2 lcks=yes mode 5 586.27 ns/loop nproc=1 lcks=no mode 6 586.91 ns/loop nproc=2 lcks=no mode 7 749.28 ns/loop nproc=1 lcks=yes mode 8 746.70 ns/loop nproc=2 lcks=yes mode 9 181.96 ns/loop nproc=1 lcks=EMPTY mode 10 242.56 ns/loop nproc=1 lcks=no mode 11 242.69 ns/loop nproc=2 lcks=no mode 12 343.80 ns/loop nproc=1 lcks=yes mode 13 343.77 ns/loop nproc=2 lcks=yes mode 14 727.79 ns/loop nproc=1 lcks=no mode 15 729.95 ns/loop nproc=2 lcks=no mode 16 850.10 ns/loop nproc=1 lcks=yes mode 17 848.02 ns/loop nproc=2 lcks=yes 200MHz Pentium Pro, -current, same binary as above; mode 0 42.76 ns/loop nproc=1 lcks=EMPTY mode 1 32.01 ns/loop nproc=1 lcks=no mode 2 33.30 ns/loop nproc=2 lcks=no mode 3 191.30 ns/loop nproc=1 lcks=yes mode 4 191.62 ns/loop nproc=2 lcks=yes mode 5 93.12 ns/loop nproc=1 lcks=no mode 6 94.54 ns/loop nproc=2 lcks=no mode 7 195.16 ns/loop nproc=1 lcks=yes mode 8 200.91 ns/loop nproc=2 lcks=yes mode 9 65.83 ns/loop nproc=1 lcks=EMPTY mode 10 90.32 ns/loop nproc=1 lcks=no mode 11 90.33 ns/loop nproc=2 lcks=no mode 12 236.61 ns/loop nproc=1 lcks=yes mode 13 236.70 ns/loop nproc=2 lcks=yes mode 14 120.83 ns/loop nproc=1 lcks=no mode 15 122.12 ns/loop nproc=2 lcks=no mode 16 276.92 ns/loop nproc=1 lcks=yes mode 17 277.19 ns/loop nproc=2 lcks=yes 200MHz pentium Pro, -current, compiled with -current compiler mode 0 35.30 ns/loop nproc=1 lcks=EMPTY mode 1 22.13 ns/loop nproc=1 lcks=no mode 2 22.31 ns/loop nproc=2 lcks=no mode 3 186.26 ns/loop nproc=1 lcks=yes mode 4 186.39 ns/loop nproc=2 lcks=yes mode 5 75.61 ns/loop nproc=1 lcks=no mode 6 78.52 ns/loop nproc=2 lcks=no mode 7 191.46 ns/loop nproc=1 lcks=yes mode 8 191.65 ns/loop nproc=2 lcks=yes mode 9 69.34 ns/loop nproc=1 lcks=EMPTY mode 10 86.68 ns/loop nproc=1 lcks=no mode 11 86.49 ns/loop nproc=2 lcks=no mode 12 237.49 ns/loop nproc=1 lcks=yes mode 13 236.67 ns/loop nproc=2 lcks=yes mode 14 134.96 ns/loop nproc=1 lcks=no mode 15 134.99 ns/loop nproc=2 lcks=no mode 16 276.90 ns/loop nproc=1 lcks=yes mode 17 277.33 ns/loop nproc=2 lcks=yes Not exactly sure what all this means but whatever mode1 17 is, it can sure be expensive.. of course this is a UP machine... julian To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Before this thread on "cache coherence" and "memory consistency" goes any further, I'd like to suggest a time-out to read something like http://www-ece.rice.edu/~sarita/Publications/models_tutorial.ps. A lot of what I'm reading has a grain of truth but isn't quite right. This paper appeared as a tutorial in IEEE Computer a couple years ago, and is fairly accessible to a general audience. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
On Mon, Jul 12, 1999 at 10:38:03PM -0700, Mike Smith wrote: I said: than indirect function calls on some architectures: inline branched code. So you still have a global variable selecting locked/non-locked, but it's a boolean, rather than a pointer. Your atomic macros are then { if (atomic_lock) asm("lock;foo"); else asm ("foo"); } This requires you to have all the methods present at compile time, which defeats the entire purpose of dynamic method loading. Pardon? I didn't see a discussion of dynamic loading anywhere here. We were referring to tiny inlined assembly language routines. The existing implementation is #defines in a C header file. Uh, no, I was asking about the overhead involved in indirect function calls specifically because there are instances where fast indirect methods would allow us to greatly improve the generality of parts of the FreeBSD kernel and modules. This includes code where we are currently using compile-time defines that require us to trade performance for generality. -- \\ The mind's the standard \\ Mike Smith \\ of the man. \\ [EMAIL PROTECTED] \\-- Joseph Merrick \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: LOCK overheads (was Re: objtrm problem probably found)
According to Matthew Dillon: Wow, now that *is* expensive! The K6 must be implementing it in microcode for it to be that bad. K6-200: 244 [21:57] roberto@keltia:src/C ./locktest 0 ... empty 26.84 ns/loop 1proc 22.62 ns/loop 2proc 22.64 ns/loop empty w/locks 17.58 ns/loop 1proc w/locks 288.28 ns/loop 2proc w/locks 288.16 ns/loop It hurts :( -- Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- [EMAIL PROTECTED] FreeBSD keltia.freenix.fr 4.0-CURRENT #72: Mon Jul 12 08:26:43 CEST 1999 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: LOCK overheads (was Re: objtrm problem probably found)
Matthew Dillon [EMAIL PROTECTED] wrote: :mode 1 17.99 ns/loop nproc=1 lcks=no :mode 3 166.33 ns/loop nproc=1 lcks=yes ... :This is a K6-2 350. Locks are pretty expensive on them. Wow, now that *is* expensive! The K6 must be implementing it in microcode for it to be that bad. I wouldn't be surprised if lock prefixes did result in microcode execution. As I stated yesterday, I don't believe locked instructions are implemented frequently enough to warrant special handling, and are therefore likely to be implemented in whichever way need the least chip area. Since you need to be able to track and mark the memory references associated with the instruction, the cheapest implementation (in terms of dedicated chip area) is likely to be something like: wait until all currently executing instructions are complete, wait until all pending memory writes are complete (at least to L1 cache), assert the lock pin and execute RMW instuction without allowing any other instructions to commence, deassert lock pin. This is (of course) probably the worst case as far as execution time as seen by that CPU - though it's not far off optimal as seen by other CPUs. (`Assert lock pin' should also be mapped into a `begin locked memory reference' using whatever cache coherency protocol is being used). I'm not saying that you can't implement a locked RMW sequence a lot better, but until the chip architects decide that the performance is an issue, they aren't likely to spend any silicon on it. The big IA-32 market is UP systems running games - and locked RMW instructions don't affect this market. Intel see the high-end of the market (where SMP and hence locked RMW is more of an issue) moving to Merced. This suggests that it's unlikely that the IA-32 will ever acquire a decent lock capability (though at least the PIII is no worse than the PII). That said, the above timings make a lock prefix cost over 50 core clocks (or 15 bus clocks) - even microcode couldn't be that bad. My other timings (core/bus cycles) were: 486DX2: 20/10, Pentium: 28/7, P-II: 34/8.5, P-III 34/7.5. I suspect that these timings are a combination of inefficient on-chip implementation of the lock prefix (see above for my reasoning behind this), together with poor off-chip handling of locked cycles. Peter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Doug Rabson wrote: On Mon, 12 Jul 1999, Peter Jeremy wrote: Mike Haertel [EMAIL PROTECTED] wrote: Um. FYI on x86, even if the compiler generates the RMW form "addl $1, foo", it's not atomic. If you want it to be atomic you have to precede the opcode with a LOCK prefix 0xF0. I'd noticed that point as well. The top of sys/i386/include/atomic.h _does_ make is clear that they aren't SMP safe: /* * Various simple arithmetic on memory which is atomic in the presence * of interrupts. * * Note: these versions are not SMP safe. */ That said, it should be fairly simple to change Matt's new in-line assembler versions to insert LOCK prefixes when building an SMP kernel. (Although I don't know that this is necessary yet, given the `Big Giant Lock'). We don't need the lock prefix for the current SMP implementation. A lock prefix would be needed in a multithreaded implementation but should not be added unless the kernel is an SMP kernel otherwise UP performance would suffer. I was under the impression that a locked instruction was essentially free at runtime, with the sole exception of being one byte larger. If there is any chance of this stuff getting exported to modules via inlines, it should support both as we don't need any more differences between SMP and non-SMP modules than we've already got. :-( Also, with the posted example, atomic_cmpex() is #ifndef I386_CPU - that means it won't be available on any of the default kernels, either at install time or for 99% of post-install systems built from a release. Cheers, -Peter -- Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found
Doug Rabson [EMAIL PROTECTED] wrote: We don't need the lock prefix for the current SMP implementation. A lock prefix would be needed in a multithreaded implementation but should not be added unless the kernel is an SMP kernel otherwise UP performance would suffer. Modulo the issue of UP vs SMP modules, the code would seem to be simply: #ifdef SMP #define ATOMIC_ASM(type,op) \ __asm __volatile ("lock; " op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p)) #else #define ATOMIC_ASM(type,op) \ __asm __volatile (op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p)) #endif Or (maybe more clearly): #ifdef SMP #define SMP_LOCK"lock; " #else #define SMP_LOCK #endif #define ATOMIC_ASM(type,op) \ __asm __volatile (SMP_LOCK op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p)) John-Mark Gurney [EMAIL PROTECTED] wrote: actually, I'm not so sure, it guarantees that NO other bus operation will succeed while this is happening... what happens if a pci bus mastering card makes a modification to this value? This is a valid point, but I don't think it's directly related to this thread. I think it's up the the various PCI device driver writers to ensure that objects shared between a PCI device and driver are correctly locked. The mechanism to do this is likely to be device- specific: Lock prefixes only protect objects no larger than 32 or 64 bits (depending on the processor), cards may require locked accesses to much larger structures. I believe the API to the PCI-locking routines should be distinct from the API for SMP locks - even though the underlying implementation may be common. Oliver Fromme [EMAIL PROTECTED] wrote: In my list of i386 clock cycles, the lock prefix is listed with 0 (zero) cycles. My i486 book states 1 cycle, although that cycle can be overlapped with several other combinations that add a cycle to the basic instruction execution time. I don't know about the Pentium and beyond timings. In any case, we have real-world timings, which are more useful. Mike Haertel [EMAIL PROTECTED] wrote: Although function calls are more expensive than inline code, they aren't necessarily a lot more so, and function calls to non-locked RMW operations are certainly much cheaper than inline locked RMW operations. Based on my timings below, this is correct, though counter-intuitive. Given the substantial cost of indirect function calls, I don't this this would be acceptable, though. I think compiling modules separately for UP/SMP is a better choice. In Message-id: 19990 [EMAIL PROTECTED], Matthew Dillon [EMAIL PROTECTED] provided some hard figures for a dual PIII-450. Expanding those figures for a range of machines (all are UP except the PIII-450, which are Matt's SMP figures), and adding the cost of using indirect function calls (patches to Matt's code at the end): i386SX-25 P-133 PII-266 PIII-450 nproc locks mode 0 1950.2339.6526.31 9.21 EMPTY mode 1 3340.5971.7424.4516.48 1 no tight mode 2 3237.5771.1825.2723.65 2 no tight mode 3 3367.65 282.31 153.2993.02 1 yes tight mode 4 3263.64 285.58 152.94 160.82 2 yes tight mode 5 9439.15 446.1660.4037.64 1 no spread mode 6 10231.96 467.3960.1689.28 2 no spread mode 7 10660.05 725.80 153.1888.32 1 yes spread mode 8 9990.18 755.87 155.18 161.08 2 yes spread mode 9 5544.82 131.3149.96? EMPTY mode 10 7234.97 174.2064.81? 1 no tight mode 11 7212.14 178.7264.87? 2 no tight mode 12 7355.46 304.74 182.75? 1 yes tight mode 13 6956.54 327.11 180.21? 2 yes tight mode 14 13603.72 582.02 100.10? 1 no spread mode 15 13443.54 543.97 101.13? 2 no spread mode 16 13731.94 717.31 207.12? 1 yes spread mode 17 13379.62 800.31 207.70? 2 yes spread Modes 9 through 17 are equivalent to modes 0-8 except that the operation is performed via a call thru a pointer-to-function. (Mode 9 is a pointer to a nop). Apart from the noticable improvement in overall speed from left to right, this shows that the lock prefix is _very_ expensive on Pentium and above, even in a UP configuration. It makes no difference on a 386. (I can produce the 486 figures tonight after I get home). It also suggests that moving to an indirect function call (to allow run-time UP/SMP selection) will be quite painful. As you can see, the lock prefix creates a stall condition on the locked memory, but does *NOT* stall other memory. This is at least CPU dependent (and may also depend on the motherboard chipset). The i486 states that `all memory is locked'. Therefore I believe the impact will be unnoticeable. On a duel 450MHz P-III we are talking 37 ns vs 88 ns -
Re: objtrm problem probably found
:Or (maybe more clearly): : :#ifdef SMP :#defineSMP_LOCK"lock; " :#else :#defineSMP_LOCK :#endif : :#define ATOMIC_ASM(type,op)\ :__asm __volatile (SMP_LOCK op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p)) Yes, precisely. :I believe the API to the PCI-locking routines should be distinct from :the API for SMP locks - even though the underlying implementation may :be common. This is more a driver issue then anything else. :Based on the impact above, I believe the lock prefixes should not :be inserted until they are necessary - even if it does mean we wind :up with /modules and /modules.smp. I don't believe that moving :to indirect function pointers is a reasonable work-around. : :Peter We can certainly remove it for the UP case, but it needs to stay in for the SMP case because there is a 100% chance that these routines are going to have to be SMP safe when the big giant lock hits up against the VM system. It will become even more important after interrupts are moved into their own threads if we want to be able to run interrupts in parallel with other supervisor code (e.g. network interrupt and network stack ). -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Although function calls are more expensive than inline code, they aren't necessarily a lot more so, and function calls to non-locked RMW operations are certainly much cheaper than inline locked RMW operations. This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? -- \\ The mind's the standard \\ Mike Smith \\ of the man. \\ [EMAIL PROTECTED] \\-- Joseph Merrick \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: : Although function calls are more expensive than inline code, : they aren't necessarily a lot more so, and function calls to : non-locked RMW operations are certainly much cheaper than : inline locked RMW operations. : :This is a fairly key statement in context, and an opinion here would :count for a lot; are function calls likely to become more or less :expensive in time? In terms of cycles, either less or the same. Certainly not more. If you think about it, a function call is nothing more then a save, a jump, and a retrieval and jump later on. On intel the save is a push, on other cpu's the save may be to a register (which is pushed later if necessary). The change in code flow used to be the expensive piece, but not any more. You typically either see a branch prediction cache (Intel) offering a best-case of 0-cycle latency, or a single-cycle latency that is slot-fillable (MIPS). Since the jump portion of a subroutine call to a direct label is nothing more then a deterministic branch, the branch prediction cache actually operates in this case. You do not quite get 0-cycle latency due to the push/pop, and potential arguments, but it is very fast. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: : Although function calls are more expensive than inline code, : they aren't necessarily a lot more so, and function calls to : non-locked RMW operations are certainly much cheaper than : inline locked RMW operations. : :This is a fairly key statement in context, and an opinion here would :count for a lot; are function calls likely to become more or less :expensive in time? ... The change in code flow used to be the expensive piece, but not any more. You typically either see a branch prediction cache (Intel) offering a best-case of 0-cycle latency, or a single-cycle latency that is slot-fillable (MIPS). I assumed too much in asking the question; I was specifically interested in indirect function calls, since this has a direct impact on method-style implementations. -- \\ The mind's the standard \\ Mike Smith \\ of the man. \\ [EMAIL PROTECTED] \\-- Joseph Merrick \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
:I assumed too much in asking the question; I was specifically :interested in indirect function calls, since this has a direct impact :on method-style implementations. Branch prediction caches are typically PC-sensitive. An indirect method call will never be as fast as a direct call, but if the indirect address is the same the branch prediction cache will work. If the indirect address changes at the PC where the call is being made, the branch cache may create a penalty. Try this core in one of the cases to that test program, and add two nop subroutines void nop1(void) { } and void nop2(void) { }. Compile this code without any optimizations! *no* optimizations or the test will not demonstrate the problem :-) In this case the branch prediction succeeds because the indirect address does not change at the PC where func() is called. I get 34 ns per loop. { void (*func)(void) = nop1; for (i = 0; i LOOPS; ++i) { func(); if (i 1) func = nop1; else func = nop1; } } In this case the branch prediction fails because the indirect address is different at the PC each time func() is called. I get 61ns. { void (*func)(void) = nop1; for (i = 0; i LOOPS; ++i) { func(); if (i 1) func = nop1; else func = nop2; } } In this case we simulate a mix. (i 1) - (i 7). I get 47 ns. { void (*func)(void) = nop1; for (i = 0; i LOOPS; ++i) { func(); if (i 7) func = nop1; else func = nop2; } } Ok, so what does this mean for method calls?If the method call is INLINED, then the branch prediction cache will tend to work because the method call will tend to call the same address at any given PC. If the method call is doubly-indirect, where a routine is called which calculates the method address and then calls it, the branch prediction cache will tend to fail because a different address will tend to be called at the PC of the call. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Mike Smith [EMAIL PROTECTED] wrote: Although function calls are more expensive than inline code, they aren't necessarily a lot more so, and function calls to non-locked RMW operations are certainly much cheaper than inline locked RMW operations. This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? Based on general computer architecture principles, I'd say that a lock prefix is likely to become more expensive[1], whilst a function call will become cheaper[2] over time. I'm not sure that this is an important issue here. The sole advantage of moving to indirect function calls would be that the same object code could be used on both UP and SMP configurations, without incurring the overhead of the lock prefix in the UP configuration. (At the expense of an additional function call in all configurations). We can't avoid the lock prefix overhead in the SMP case. Based on the timings I did this morning, function calls are (unacceptably, IMHO) expensive on all the CPU's I have to hand (i386, Pentium and P-II) - the latter two presumably comprising the bulk of current FreeBSD use. Currently the UP/SMP decision is made at compile time (and has significant and widespread impact) - therefore there seems little (if any) benefit in using function calls within the main kernel. I believe that Matt's patched i386/include/atomic.h, with the addition of code to only include the lock prefix when SMP is defined, is currently the optimal approach for the kernel - and I can't see any way a future IA-32 implementation could change that. The only benefit could be for kernel modules - a module could possibly be compiled so the same LKM would run on either UP or SMP. Note that function calls for atomic operations may not be sufficient (by themselves) to achieve this: One of the SMP gurus may be able to confirm whether anything else prevents an SMP-compiled LKM running with a UP kernel. If the lock prefix overhead becomes an issue for LKMs, then we could define a variant of i386/include/atomic.h (eg by using a #define which is only true for compiling LKMs) which does use indirect function calls (and add the appropriate initialisation code). This is a trivial exercise (which I'll demonstrate on request). [1] A locked instruction implies a synchronous RMW cycle. In order to meet write-ordering guarantees (without which, a locked RMW cycle would be useless as a semaphore primitive), it implies a complete write serialization, and probably some level of instruction serialisation. Since write-back pipelines will get longer and parallel execution units more numerous, the cost of a serialisation operation will get relatively higher. Also, lock instructions are relatively infrequent, therefore there is little incentive to expend valuable silicon on trying to make them more efficient (at least as seen by the executing CPU). [2] Function calls _are_ fairly common, therefore it probably is worthwhile expending some effort in optimising them - and the stack updates associated with a leaf subroutine are fairly easy to totally hide in an on-chip write pipeline/cache. Peter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: :I'm not sure there's any reason why you shouldn't. If you changed the :semantics of a stack segment so that memory addresses below the stack :pointer were irrelevant, you could implement a small, 0-cycle, on-chip :stack (that overflowed into memory). I don't know whether this :semantic change would be allowable (and whether the associated silicon :could be justified) for the IA-32. : :Peter This would be relatively complex and also results in cache coherency problems. A solution already exists: It's called branch-and-link, but Intel cpu's do not use it because Intel cpu's do not have enough registers (makes you just want to throw up -- all that MMX junk and they couldn't add a branch and link register! ). The key with branch-and-link is that the lowest subroutine level does not have to save/restore the register, making entry and return two or three times faster then subroutine calls that make other subroutine calls. The big problem with implementing complex caches is that it takes up a serious amount of die space and power. Most modern cpu's revolve almost entirely around their L1 cache and their register file. The remaining caches tend to be ad-hoc. Intel's branch prediction cache is like this. In order for a memory-prediction cache to be useful, it really needs to be cache-coherent, which basically kills the idea of having a separate little special case for the stack. Only the L1 cache is coherent. If you wanted you could implement multiple L1 data caches on-chip - that might be of some benefit, but otherwise branch-and-link is the better way to do it. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: :Based on general computer architecture principles, I'd say that a lock :prefix is likely to become more expensive[1], whilst a function call :will become cheaper[2] over time. :... : :[1] A locked instruction implies a synchronous RMW cycle. In order :to meet write-ordering guarantees (without which, a locked RMW :cycle would be useless as a semaphore primitive), it implies a :complete write serialization, and probably some level of :instruction serialisation. Since write-back pipelines will get A locked instruction only implies cache coherency across the instruction. It does not imply serialization. Intel blows it big time, but that's intel for you. :longer and parallel execution units more numerous, the cost of :a serialisation operation will get relatively higher. Also, :Peter It is not a large number of execution units that implies a higher cost of serialization but instead data interdependancies. A locked instruction doe snot have to imply serialization. Modern cache coherency protocols do not have a problem with a large number of caches in a parallel processing subsystem. This is how a typical two-level cache coherency protocol works with an L1 and L2 cache: * The L2 cache is usually a superset of the L1 cache. That is, all cache lines in the L1 cache also exist in the L2 cache. * Both the L1 and L2 caches implement a shared and an exclusive bit, usually on a per-cache-line basis. * When a processor, A, issues a memory op that is not in its cache, it can request the cache line from main memory either shared (unlocked) or exclusive (locked). * All other processors snoop the main memory access. * If another processor's L2 cache contains the cache line being requested by processor A, it can provide the cache line to processor A and no main memory op actually occurs. In this case, the shared and exclusive bits in processor B's L1 and L2 caches are updated appropriately. - if A is requesting shared (unlocked) access, both A and B will set the shared bit and B will clear the exclusive bit. (B will cause A to stall if it is in the middle of operating on the locked cache line). - if A is requesting exclusive (locked) access, B will invalidate its cache line and clear both the shared and exclusive bits, and A will set its exclusive bit. A now owns that cache line. - If no other processor owns the cache line, A obtains the data from main memory and other processors have the option to snoop the access in their L2 caches. So, using the above rules as an example, a locked instruction can cost as little as 0 extra cycles no matter how many cpu's you have running in parallel. There is no need to serialize or synchronize anything. The worst case is actually not even as bad as a complete cache-miss case. The cost of snooping another cpu's L2 cache is much less then the cost of going to main memory. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
On Mon, Jul 12, 1999 at 07:09:58PM -0700, Mike Smith wrote: Although function calls are more expensive than inline code, they aren't necessarily a lot more so, and function calls to non-locked RMW operations are certainly much cheaper than inline locked RMW operations. This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? Others have answered this question, but I thought I'd point out that there is another alternative that is likely to be faster than indirect function calls on some architectures: inline branched code. So you still have a global variable selecting locked/non-locked, but it's a boolean, rather than a pointer. Your atomic macros are then { if (atomic_lock) asm("lock;foo"); else asm ("foo"); } You might be interested in the paper: "Efficient Dynamic Dispatch without Virtual Function Tables. The SmallEiffel Compiler." Olivier ZENDRA, Dominique COLNET, Suzanne COLLIN. 12th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'97), Volume 32, Issue 10 - Atlanta, GA, USA, October 1997, pages 125-141. http://SmallEiffel.loria.fr/papers/papers.html#OOPSLA97 as a justification for that approach. -- Andrew To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
:... I would also like to add a few more notes in regards to write pipelines. Write pipelines are not used any more, at least not long ones. The reason is simply the cache coherency issue again. Until the data is actually written into the L1 cache, it is acoherent. Acoherency is no longer allowed :-). That kind creates a problem for the old write pipeline. Thus, modern cpu's have begun removing their write pipelines in favor of a synchronous write to the L1 cache. This typically occurs in the "write-back" phase of the instruction pipeline. It does not cause a processor to stall. There are several advantages to this: * First, since the write is going directly into the L1 cache, it can partake of the cache coherency protocols. * Second, since the L1 cache is large compared to the older write-pipeline schema, this effectively gives us a (for all intents and purposes other then a memory copy) infinitely-sized write pipeline. * Third, we can throw away all the write pipeline address snooping junk on the chip, making the chip smaller. The cache coherency protocol may cause side effects when a memory write is issued. Typically the write is issued into the L1 cache and requires what is known as a write-through operation into the L2 cache in order to guarentee cache coherency. This is necessary because it is the L2 cache which is performing the bulk of the bus protocol to implement the cache coherency, and it needs to know when you've modified something. "write through" *used* to mean writing through to main memory, but with the advent of modern cache coherency protocols, the L2 cache *IS* main memory for all intents and purposes. Dirty data contained in the L2 cache inherently invalidates the associated memory address in main memory simply by being present in the L2 cache. Another processor attempting to access that location in main memory will instead obtain the data directly from our processor's L2 cache. One of the interesting benefits from this is that dirty data in the L2 cache can be flushed to main memory completely asynchronously without screwing up cache coherency, serialization, or anything else. Final note: memory mapped I/O spaces do not apply here. These are usually marked uncacheable and have nothing to do with the cache. But DMA is different. DMA works just like any other memory access in that it can be made to run through the L2 cache coherency protocols. So you can often safely initiate DMA without having to flush your caches. This is not true of all processors. I'm not sure about the Pentium's. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Second answer: in the real world, we're nearly always hitting the cache on stack operations associated with calls and argument passing, but not less often on operations in the procedure body. So, in ^^^ typo Urk. I meant to say "less often", delete the "not". To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? Ambiguous question. First answer: Assume we're hitting the cache, taking no branch mispredicts, and everything is generally going at "the speed of light". The cost of a call, relative to the cost of "basic operations" like loads and adds, will stay about the same in Intel's next-generation x86 processors as it is now. We made some tradeoffs, some things go a little faster and others a little slower, but in the end the function call overhead averages out to about the same. Second answer: in the real world, we're nearly always hitting the cache on stack operations associated with calls and argument passing, but not less often on operations in the procedure body. So, in the real world, as memory latencies get worse and worse relative to processor frequency, the overhead for procedure calling represents an increasingly smaller percentage of the total execution time. Summary: So, correctly predicted calls are staying the same or getting cheaper. Certainly calls are not getting worse, unless perhaps you have lots of mispredicted calls. As pipelines get deeper, mispredicts get worse. But that's not a problem just for calls, it's branches in general. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message