I would ask David S. Miller about the sparc ASM stuff - he seems to be the resident sparc genius and linux kernel maintainer.
On Fri, Jun 5, 2015 at 4:18 PM, James Y Knight <[email protected]> wrote: > > On Jun 4, 2015, at 11:07 AM, James Y Knight <[email protected]> wrote: > GLibc > ===== > After that, everything seemed to be going fine, except that programs like > GCC would randomly segfault and give parse errors. This has been reported > before, e.g. http://thread.gmane.org/gmane.linux.ports.sparc/16835, from > 2 years ago. Things were stable enough to use interactively, if you're > willing to keep retrying a build until it works, but not stable enough to > use for any autobuild system. > > After a getting a hint from Aurelien that disabling optimized memcpy > routines in glibc (eglibc 2.19-1, on Wed, 04 Jun 2014 20:32:06 +0200) had > improved, but did not fix, the problem, I started looking into that.... > > ...And found that recompiling glibc, disabling the sparcv9 optimizations > (that is: eliminating debian/patches/sparc/local-sparcv9-target.diff), > *appears* to have completely fixed the stability issue! > > To try to verify that, I ran a loop building and rebuilding 'clang' (with > full "ninja" parallelism) overnight, and it's had zero crashes in all 14 > builds of clang that it got through. Prior to fixing glibc, at least one of > the ~2300 build steps (gcc/as/ld) was sure to crash unreproducibly. > > It'd be great if someone wants to try to figure out exactly /which/ of the > asm routines in the various sysdeps/**/sparc32/sparcv9 are broken, to > narrow down the problem better, too. I highly suspect there's just > something wrong in one or more of the hand-written asm files, but it's > certainly possible there's some wider problem that the sparcv9 > optimizations of glibc (but nothing else I've seen so far), just happens to > expose. > > So, bad news and good news: > > Bad News: the above solution of simply disabling sparcv9 breaks some > things (other than gcc). It breaks something about atomics or semaphores, > likely due to a mismatch of expectations between libc and other things (the > sparc32 routines, when *NOT* compiled in a shared library, dynamically > choose between the v8 and v9 ways of doing things, so it's entirely > reasonable to assume that doing it the v8 way cannot work right). > > Good News: > > My next attempt at a fix, is to just disable the optimized string ops: > rm sysdeps/sparc/sparc32/sparcv9/*mem* sysdeps/sparc/sparc32/sparcv9/*st* > That seems to still have fixed the random gcc crashes, AND doesn't break > other things. :) > > > Looking into what the deleted routines are doing that's "interesting": > > * memcpy and memset: > > They're using LDBLOCKF STBLOCKF "block copy" instructions, which are: > 1) Not actually part of the Sparcv9 standard instruction set, but rather > are processor-specific (Although, these processor-specific instructions > have been implemented since the UltraSPARC I). > "The LDBLOCKF instruction is intended to be a processor-specific > instruction, which may or may not be implemented in future Oracle SPARC > Architecture implementations. Therefore, it should only be used in > platform-specific dynamically-linked libraries or in software created by a > runtime code generator that is aware of the specific virtual processor > implementation on which it is executing." > > 2) Marked deprecated. > "The LDBLOCKF instructions are deprecated and should not be used in > new software. A sequence of LDDF instructions should be used instead." > > 3) Don't follow the normal TSO memory model ordering that everything else > does; they require explicit MEMBARs in the right places to ensure even > *single-thread/cpu* memory ordering correctness. > "Block operations do not generally conform to dependence order on the > issuing virtual processor; that is, no read-after-write or write-after-read > checking occurs between block loads and stores. Explicit MEMBARs are > required to enforce dependence ordering between block operations that > reference the same address." > > It certainly looks like the author of those routines *tried* to do the > right thing w.r.t. inserting membar instructions in the right place, but I > can easily imagine it's wrong somehow. And it is entirely plausible that > the behavior would be hardware-generation specific, since it has, by > design, weird hardware-specific memory semantics. I'm placing my bets on > this one being the problem. > > * memchr, memcmp, strcmp, strcpy, etc. > > These are using a nonfaulting load instruction. The nonfaulting load > doesn't actually mean the hardware doesn't fault on loading from an > unmapped page. Actually, unmapped pages still cause a fault, but the fault > is supposed to be handled by the OS. It's also possible to map pages as > "for use by nonfaulting loads only" (linux doesn't appear to do this). > > That's a rare instruction -- not generated by GCC I think, so I could > imagine there being a bug in the fault handler for it. I think that's less > likely though, since it doesn't seem like it'd be CPU-architecture specific. > > James > >

