Re: Good news on Debian Sparc port stability

Patrick Baggett Fri, 05 Jun 2015 14:40:42 -0700

I would ask David S. Miller about the sparc ASM stuff - he seems to be the
resident sparc genius and linux kernel maintainer.


On Fri, Jun 5, 2015 at 4:18 PM, James Y Knight <[email protected]> wrote:

>
> On Jun 4, 2015, at 11:07 AM, James Y Knight <[email protected]> wrote:
> GLibc
> =====
> After that, everything seemed to be going fine, except that programs like
> GCC would randomly segfault and give parse errors. This has been reported
> before, e.g. http://thread.gmane.org/gmane.linux.ports.sparc/16835, from
> 2 years ago. Things were stable enough to use interactively, if you're
> willing to keep retrying a build until it works, but not stable enough to
> use for any autobuild system.
>
> After a getting a hint from Aurelien that disabling optimized memcpy
> routines in glibc (eglibc 2.19-1, on Wed, 04 Jun 2014 20:32:06 +0200) had
> improved, but did not fix, the problem, I started looking into that....
>
> ...And found that recompiling glibc, disabling the sparcv9 optimizations
> (that is: eliminating debian/patches/sparc/local-sparcv9-target.diff),
> *appears* to have completely fixed the stability issue!
>
> To try to verify that, I ran a loop building and rebuilding 'clang' (with
> full "ninja" parallelism) overnight, and it's had zero crashes in all 14
> builds of clang that it got through. Prior to fixing glibc, at least one of
> the ~2300 build steps (gcc/as/ld) was sure to crash unreproducibly.
>
> It'd be great if someone wants to try to figure out exactly /which/ of the
> asm routines in the various sysdeps/**/sparc32/sparcv9 are broken, to
> narrow down the problem better, too. I highly suspect there's just
> something wrong in one or more of the hand-written asm files, but it's
> certainly possible there's some wider problem that the sparcv9
> optimizations of glibc (but nothing else I've seen so far), just happens to
> expose.
>
> So, bad news and good news:
>
> Bad News: the above solution of simply disabling sparcv9 breaks some
> things (other than gcc). It breaks something about atomics or semaphores,
> likely due to a mismatch of expectations between libc and other things (the
> sparc32 routines, when *NOT* compiled in a shared library, dynamically
> choose between the v8 and v9 ways of doing things, so it's entirely
> reasonable to assume that doing it the v8 way cannot work right).
>
> Good News:
>
> My next attempt at a fix, is to just disable the optimized string ops:
>  rm sysdeps/sparc/sparc32/sparcv9/*mem* sysdeps/sparc/sparc32/sparcv9/*st*
> That seems to still have fixed the random gcc crashes, AND doesn't break
> other things. :)
>
>
> Looking into what the deleted routines are doing that's "interesting":
>
> * memcpy and memset:
>
> They're using LDBLOCKF STBLOCKF "block copy" instructions, which are:
> 1) Not actually part of the Sparcv9 standard instruction set, but rather
> are processor-specific (Although, these processor-specific instructions
> have been implemented since the UltraSPARC I).
> "The LDBLOCKF instruction is intended to be a processor-specific
> instruction, which may or may not be implemented in future Oracle SPARC
> Architecture implementations. Therefore, it should only be used in
> platform-specific dynamically-linked libraries or in software created by a
> runtime code generator that is aware of the specific virtual processor
> implementation on which it is executing."
>
> 2) Marked deprecated.
> "The LDBLOCKF instructions are deprecated and should not be used in
> new software. A sequence of LDDF instructions should be used instead."
>
> 3) Don't follow the normal TSO memory model ordering that everything else
> does; they require explicit MEMBARs in the right places to ensure even
> *single-thread/cpu* memory ordering correctness.
> "Block operations do not generally conform to dependence order on the
> issuing virtual processor; that is, no read-after-write or write-after-read
> checking occurs between block loads and stores. Explicit MEMBARs are
> required to enforce dependence ordering between block operations that
> reference the same address."
>
> It certainly looks like the author of those routines *tried* to do the
> right thing w.r.t. inserting membar instructions in the right place, but I
> can easily imagine it's wrong somehow. And it is entirely plausible that
> the behavior would be hardware-generation specific, since it has, by
> design, weird hardware-specific memory semantics. I'm placing my bets on
> this one being the problem.
>
> * memchr, memcmp, strcmp, strcpy, etc.
>
> These are using a nonfaulting load instruction. The nonfaulting load
> doesn't actually mean the hardware doesn't fault on loading from an
> unmapped page. Actually, unmapped pages still cause a fault, but the fault
> is supposed to be handled by the OS. It's also possible to map pages as
> "for use by nonfaulting loads only" (linux doesn't appear to do this).
>
> That's a rare instruction -- not generated by GCC I think, so I could
> imagine there being a bug in the fault handler for it. I think that's less
> likely though, since it doesn't seem like it'd be CPU-architecture specific.
>
> James
>
>

Re: Good news on Debian Sparc port stability

Reply via email to