Re: [PATCH?] Separate pthread patches, #2 take 2.
Corinna Vinschen wrote: On Jun 4 02:52, Dave Korn wrote: Dave Korn wrote: Dave Korn wrote: The attached patch implements ilockexch and ilockcmpexch, using the inline asm definition from __arch_compare_and_exchange_val_32_acq in glibc-2.10.1/sysdeps/i386/i486/bits/atomic.h, trivially expanded inline rather than in its original preprocessor macro form. It generates incorrect code. This much looks like it's probably a compiler bug. Let's see whether anyone else agrees: http://gcc.gnu.org/ml/gcc/2009-06/msg00053.html When you checked in this change, I'll create a 1.7.0-49 test release. This is the final version I committed. It is the glibc version, with the addition of the memory clobber, which that discussion revealed was absolutely required, and the use of a register asm var to feed the inline asm, which is in accordance with documented practice in the gcc manual. (This now leaves only one difference between the glibc version and the version I posted, which is the use of a +a write-only output constraint paired with a numeric 0 matching input constraint in glibc's version compared with a single output operand using the read-write constraing =a in my version. These should in fact be exactly identical in terms of what they indicate to reload, in any case.) I have also manually inspected the generated assembly from thread.cc and shared.cc in a cygwin DLL build and verified that it is correct and efficient, and have installed the resulting DLL and retested all Thomas Stalder's testcases and the previously intermittently failing pthread7-rope testcase from libstdc++ testsuite. Committed with this ChangeLog: winsup/cygwin/ChangeLog * winbase.h (ilockexch): Fix asm constraints. (ilockcmpexch): Likewise. Libstdc++ plan after the weekend. Cheers all! cheers, DaveK ? winsup/cygwin/cygwin-cxx.h ? winsup/cygwin/mutex Index: winsup/cygwin/winbase.h === RCS file: /cvs/src/src/winsup/cygwin/winbase.h,v retrieving revision 1.14 diff -p -u -r1.14 winbase.h --- winsup/cygwin/winbase.h 12 Jul 2008 18:09:17 - 1.14 +++ winsup/cygwin/winbase.h 5 Jun 2009 13:05:08 - @@ -38,22 +38,31 @@ ilockdecr (volatile long *m) extern __inline__ long ilockexch (volatile long *t, long v) { - register int __res; - __asm__ __volatile__ (\n\ -1: lock cmpxchgl %3,(%1)\n\ - jne 1b\n\ - : =a (__res), =q (t): 1 (t), q (v), 0 (*t): cc); - return __res; + return + ({ +register __typeof (*t) ret __asm (%eax); +__asm __volatile (\n + 1: lock cmpxchgl %2, %1\n + jne 1b\n + : =a (ret), =m (*t) + : r (v), m (*t), 0 (*t) + : memory); +ret; + }); } extern __inline__ long ilockcmpexch (volatile long *t, long v, long c) { - register int __res; - __asm__ __volatile__ (\n\ - lock cmpxchgl %3,(%1)\n\ - : =a (__res), =q (t) : 1 (t), q (v), 0 (c): cc); - return __res; + return + ({ +register __typeof (*t) ret __asm (%eax); +__asm __volatile (lock cmpxchgl %2, %1 + : =a (ret), =m (*t) + : r (v), m (*t), 0 (c) + : memory); +ret; + }); } #undef InterlockedIncrement
Re: [PATCH?] Separate pthread patches, #2 take 2.
On Jun 5 15:04, Dave Korn wrote: Corinna Vinschen wrote: On Jun 4 02:52, Dave Korn wrote: Dave Korn wrote: Dave Korn wrote: The attached patch implements ilockexch and ilockcmpexch, using the inline asm definition from __arch_compare_and_exchange_val_32_acq in glibc-2.10.1/sysdeps/i386/i486/bits/atomic.h, trivially expanded inline rather than in its original preprocessor macro form. It generates incorrect code. This much looks like it's probably a compiler bug. Let's see whether anyone else agrees: http://gcc.gnu.org/ml/gcc/2009-06/msg00053.html When you checked in this change, I'll create a 1.7.0-49 test release. This is the final version I committed. It is the glibc version, with the addition of the memory clobber, which that discussion revealed was absolutely required, and the use of a register asm var to feed the inline asm, which is in accordance with documented practice in the gcc manual. (This now leaves only one difference between the glibc version and the version I posted, which is the use of a +a write-only output constraint paired with a numeric 0 matching input constraint in glibc's version compared with a single output operand using the read-write constraing =a in my version. These should in fact be exactly identical in terms of what they indicate to reload, in any case.) I have also manually inspected the generated assembly from thread.cc and shared.cc in a cygwin DLL build and verified that it is correct and efficient, and have installed the resulting DLL and retested all Thomas Stalder's testcases and the previously intermittently failing pthread7-rope testcase from libstdc++ testsuite. Committed with this ChangeLog: Cool, thank you. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat
Re: [PATCH?] Separate pthread patches, #2 take 2.
On Fri, Jun 05, 2009 at 03:04:59PM +0100, Dave Korn wrote: winsup/cygwin/ChangeLog * winbase.h (ilockexch): Fix asm constraints. (ilockcmpexch): Likewise. Thanks for seeing this through. It was obviously a lot of work. cgf
Re: [PATCH?] Separate pthread patches, #2 take 2.
Christopher Faylor wrote: On Fri, Jun 05, 2009 at 03:04:59PM +0100, Dave Korn wrote: winsup/cygwin/ChangeLog * winbase.h (ilockexch): Fix asm constraints. (ilockcmpexch): Likewise. Thanks for seeing this through. It was obviously a lot of work. cgf I appreciate the need to be diligent when working so deep in the bowels of the fundaments, so no problem :) cheers, DaveK
Re: [PATCH?] Separate pthread patches, #2 take 2.
On Jun 4 02:52, Dave Korn wrote: Dave Korn wrote: Dave Korn wrote: The attached patch implements ilockexch and ilockcmpexch, using the inline asm definition from __arch_compare_and_exchange_val_32_acq in glibc-2.10.1/sysdeps/i386/i486/bits/atomic.h, trivially expanded inline rather than in its original preprocessor macro form. It generates incorrect code. This much looks like it's probably a compiler bug. Let's see whether anyone else agrees: http://gcc.gnu.org/ml/gcc/2009-06/msg00053.html When you checked in this change, I'll create a 1.7.0-49 test release. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat
Re: [PATCH?] Separate pthread patches, #2 take 2.
Corinna Vinschen wrote: On Jun 4 02:52, Dave Korn wrote: Dave Korn wrote: Dave Korn wrote: The attached patch implements ilockexch and ilockcmpexch, using the inline asm definition from __arch_compare_and_exchange_val_32_acq in glibc-2.10.1/sysdeps/i386/i486/bits/atomic.h, trivially expanded inline rather than in its original preprocessor macro form. It generates incorrect code. This much looks like it's probably a compiler bug. Let's see whether anyone else agrees: http://gcc.gnu.org/ml/gcc/2009-06/msg00053.html When you checked in this change, I'll create a 1.7.0-49 test release. Am off to a meeting now; won't get a chance until late tonight. cheers, DaveK
[PATCH?] Separate pthread patches, #2 take 2.
The attached patch implements ilockexch and ilockcmpexch, using the inline asm definition from __arch_compare_and_exchange_val_32_acq in glibc-2.10.1/sysdeps/i386/i486/bits/atomic.h, trivially expanded inline rather than in its original preprocessor macro form. It generates incorrect code. The sequence discussed before, 126 do 127 node-next = head; 128 while (InterlockedCompareExchangePointer (head, node, node-next) != node-next); is now compiled down to this loop: L186: .loc 3 127 0 movl__ZN13pthread_mutex7mutexesE+8, %edx # mutexes.head, D.28599 .loc 2 58 0 movl%edx, %eax # D.28599, tmp68 /APP # 58 /gnu/winsup/src/winsup/cygwin/winbase.h 1 lock cmpxchgl %ebx, __ZN13pthread_mutex7mutexesE+8 # this, # 0 2 /NO_APP movl%eax, -12(%ebp) # tmp68, ret .loc 2 59 0 movl-12(%ebp), %eax # ret, D.28596 .loc 3 126 0 cmpl%eax, %edx # D.28596, D.28599 jne L186 #, movl%edx, 36(%ebx) # D.28599, variable.next ... which is in fact the equivalent of 126 do 127 ; 128 while (InterlockedCompareExchangePointer (head, node, node-next) != node-next); (126) node-next = head; As it caches the values of head in %edx during the spin loop, and only stores it to node-next after having overwritten *head with node, there is a short window after the new node is linked to the front of the chain during which its chain pointer is incorrect. Not OK for head? cheers, DaveK Index: winsup/cygwin/winbase.h === RCS file: /cvs/src/src/winsup/cygwin/winbase.h,v retrieving revision 1.14 diff -p -u -r1.14 winbase.h --- winsup/cygwin/winbase.h 12 Jul 2008 18:09:17 - 1.14 +++ winsup/cygwin/winbase.h 3 Jun 2009 22:54:38 - @@ -34,27 +34,31 @@ ilockdecr (volatile long *m) : =r (__res), =m (*m): m (*m): cc); return __res; } - -extern __inline__ long -ilockexch (volatile long *t, long v) -{ - register int __res; - __asm__ __volatile__ (\n\ -1: lock cmpxchgl %3,(%1)\n\ - jne 1b\n\ - : =a (__res), =q (t): 1 (t), q (v), 0 (*t): cc); - return __res; -} - -extern __inline__ long -ilockcmpexch (volatile long *t, long v, long c) -{ - register int __res; - __asm__ __volatile__ (\n\ - lock cmpxchgl %3,(%1)\n\ - : =a (__res), =q (t) : 1 (t), q (v), 0 (c): cc); - return __res; -} + +extern __inline__ long +ilockexch (volatile long *t, long v) +{ + return ({ + __typeof (*t) ret; + __asm __volatile (1: lock cmpxchgl %2, %1\n + jne 1b\n + : =a (ret), =m (*t) + : r (v), m (*t), 0 (*t)); + ret; + }); +} + +extern __inline__ long +ilockcmpexch (volatile long *t, long v, long c) +{ + return ({ + __typeof (*t) ret; + __asm __volatile (lock cmpxchgl %2, %1 + : =a (ret), =m (*t) + : r (v), m (*t), 0 (c)); + ret; + }); +} #undef InterlockedIncrement #define InterlockedIncrement ilockincr
Re: [PATCH?] Separate pthread patches, #2 take 2.
Dave Korn wrote: The attached patch implements ilockexch and ilockcmpexch, using the inline asm definition from __arch_compare_and_exchange_val_32_acq in glibc-2.10.1/sysdeps/i386/i486/bits/atomic.h, trivially expanded inline rather than in its original preprocessor macro form. It generates incorrect code. This much looks like it's probably a compiler bug. This version, compiled with current GCC HEAD, generates the same results as in the take 3 version, without needing the explicit memory clobber added: L469: .loc 2 127 0 movl__ZN13pthread_mutex7mutexesE+8, %eax # mutexes.head, D.30413 movl%eax, 36(%ebx) # D.30413, variable.next .loc 5 58 0 /APP # 58 /gnu/winsup/src/winsup/cygwin/winbase.h 1 lock cmpxchgl %ebx, __ZN13pthread_mutex7mutexesE+8 # this, # 0 2 /NO_APP movl%eax, -12(%ebp) # tmp76, ret .loc 5 59 0 movl-12(%ebp), %eax # ret, D.30414 .loc 2 126 0 cmpl%eax, 36(%ebx) # D.30414, variable.next jne L469 #, ... right down to the unoptimised register motion. This is what we would have hoped to see: the memory clobber ought to be superfluous, and the write-output operand in *t should have told GCC not to move the store to node-next after the loop. I checked 4.3.3; it behaves the same as 4.3.2, i.e. it incorrectly lowers the store without a memory clobber present. cheers, DaveK