Re: [gem5-dev] Failed SPARC test

Steve Reinhardt Sat, 29 Oct 2011 17:36:07 -0700

Bleah, this is ugly!  Reading that one bug report Gabe linked to, it sounds
like -frounding-math is supposed to make this work, but it's not correctly
implemented, and as a result there's really no straightforward way to make
this work.  I think that should be documented somewhere so that one day, if
-frounding-math does get implemented properly, we can start relying on it
and not on whatever hack we come up with.


Another idea, assuming m5_fesetround() isn't inlined, would be to have it
accept a double argument that it just passes back unmodified.  Then you
could do something like:

Frs1s = m5_fesetround(newrnd, Frs1s);
Frds = Frs1s + Frs2s;
Frds = m5_fesetround(oldrnd, Frds);

Would that work?

Steve

On Sat, Oct 29, 2011 at 4:51 PM, Gabe Black <[email protected]> wrote:

> I don't think either will work because it's not the optimizations in
> those functions or the functions order relative to each other or the
> asms, it's the position of the add relative to the asms. Since the add
> can move around wherever, it doesn't matter if the calls to fesetround
> are bounded by the asms. We could potentially mark the execute function
> with a different optimization level though. That might work. Also, I
> have that filterDoubles function in there that finds fp operands that
> are doubles and builds them from or breaks them down into single floats.
> We could possibly piggyback on that to add in asms with the right
> properties like in ARM. It's a bit gross, but like you said I don't know
> if we can avoid that.
>
> Gabe
>
> On 10/29/11 16:31, Ali Saidi wrote:
> > If we go down the path below, slighly less hacky might be just making
> gcc compiler the entire fenv file without optimization, although perhaps
> that is insufficient....
> >
> > Ali
> >
> > On Oct 29, 2011, at 6:30 PM, Ali Saidi wrote:
> >
> >> What about making m5_fesetround and m5_fegetround() modify memory and
> thus prevent reordering?
> >>
> >> Something like:
> >>
> >> volatile int dummy_compiler;
> >>
> >> void m5_fesetround(int rm)
> >> {
> >>    assert(rm >= 0 && rm < 4);
> >>    dummy_compiler++;
> >>    fesetround(m5_round_ops[rm]);
> >>    dummy_compiler++;
> >> }
> >>
> >> int m5_fegetround()
> >> {
> >>    int x;
> >>    dummy_compiler++;
> >>    int rm = fegetround();
> >>    dummy_compiler++;
> >>    for(x = 0; x < 4; x++)
> >>        if (m5_round_ops[x] == rm)
> >>            return x;
> >>    abort();
> >>    return 0;
> >> }
> >>
> >> Would that just fix it? Mabye m5_round_ops and rm could be made
> volatile instead?
> >>
> >> Another possible solution and hack, but I think we're into hack
> territory no matter what since gcc seems brain damaged in this regard:
> >>
> >> #if __GNUC__ > 3 && __GNUC_MINOR__  > 3 // 4.4 or newer
> >> #pragma GCC push_options
> >> #pragma GCC optimize ("O0")
> >>
> >> // m5_fe* goes here
> >>
> >> #pragma GCC pop_options
> >> #endif
> >>
> >>
> >> A third option would be something like
> >>
> >> void __attribute__((optimize("O0")) m5_fesetround(int rm)...
> >>
> >> Ali
> >>
> >>
> >> On Oct 29, 2011, at 4:59 PM, Gabe Black wrote:
> >>
> >>> http://permalink.gmane.org/gmane.comp.gcc.help/38146
> >>>
> >>> On 10/29/11 14:21, Gabe Black wrote:
> >>>> Yes, it doesn't work either. What makes the ARM asm statements work is
> >>>> that they have input and output arguments. That ties them into the
> data
> >>>> flow graph having to do with those values, and they act as anchors,
> >>>> forcing values to be produced by the time you get to the asm and not
> to
> >>>> be consumed before it. Here we're just saying not to trust memory from
> >>>> before the asm, and since it's not *in* memory, the compiler merrily
> >>>> ignores us. I had this problem with ARM initially too until I added
> the
> >>>> arguments. I've tried making floating point variables volatile to
> ensure
> >>>> they're in memory, and that doesn't work either. I think the actual
> >>>> semantics of volatile are a little different than what most people
> >>>> assume, although I don't remember what the distinction is. One option
> >>>> might be to make the FP operation itself a virtual function. Then gcc
> >>>> won't know what it does and will be less able to break things by
> moving
> >>>> things around.
> >>>>
> >>>> It seems like a pretty severe deficiency of gcc that there's no way to
> >>>> make fesetround work properly. It becomes nearly worthless because you
> >>>> can't make any assumptions about when it will actually be in effect.
> >>>> That's what we have to work with, though.
> >>>>
> >>>> Gabe
> >>>>
> >>>> On 10/29/11 13:53, Ali Saidi wrote:
> >>>>> I was just about to send a message about -frounding-math when I saw
> yours. Interesting that the asm barriers appears to work with ARM. It feels
> like there should be an explicit code motion barrier. Anyway, have we tried
> compiling with the -frounding-math flag?
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ali
> >>>>>
> >>>>> Sent from my ARM powered device
> >>>>>
> >>>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <[email protected]>
> wrote:
> >>>>>
> >>>>>> Here's a discussion on the gcc mailing list of the thing I was
> talking
> >>>>>> about before that's supposed to fix this, I think.
> >>>>>>
> >>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
> >>>>>>
> >>>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all
> be
> >>>>>> registers.
> >>>>>>
> >>>>>> Gabe
> >>>>>>
> >>>>>> On 10/29/11 13:31, Gabe Black wrote:
> >>>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
> >>>>>>>
> >>>>>>> 0x00000000008d538e <+382>:   callq  0x4cab70 <m5_fegetround>
> >>>>>>> 0x00000000008d5393 <+387>:   mov    %eax,%r15d
> >>>>>>> 0x00000000008d5396 <+390>:   mov    %r14d,%edi
> >>>>>>> 0x00000000008d5399 <+393>:   callq  0x4cab30 <m5_fesetround>
> >>>>>>> 0x00000000008d539e <+398>:   mov    %r15d,%edi
> >>>>>>> 0x00000000008d53a1 <+401>:   callq  0x4cab30 <m5_fesetround>
> >>>>>>>
> >>>>>>>
> >>>>>>> This is, more or less, from the following code.
> >>>>>>>
> >>>>>>>
> >>>>>>>  __asm__ __volatile__ ("" ::: "memory");
> >>>>>>>  int oldrnd = m5_fegetround();
> >>>>>>>  __asm__ __volatile__ ("" ::: "memory");
> >>>>>>>  m5_fesetround(newrnd);
> >>>>>>>  __asm__ __volatile__ ("" ::: "memory");
> >>>>>>> Frds = Frs1s + Frs2s;
> >>>>>>>  __asm__ __volatile__ ("" ::: "memory");
> >>>>>>> m5_fesetround(oldrnd);
> >>>>>>>  __asm__ __volatile__ ("" ::: "memory");
> >>>>>>>
> >>>>>>>
> >>>>>>> Note that the addition was moved out of the middle and fesetround
> was
> >>>>>>> called twice back to back, once to set the new rounding mode, and
> once
> >>>>>>> to set it right back again.
> >>>>>>>
> >>>>>>> Gabe
> >>>>>>>
> >>>>>>> On 10/28/11 08:31, Ali Saidi wrote:
> >>>>>>>> I'm still not 100% convinced that this is it. I agree it's highly
> >>>>>>>> likely, but it could be some other code movement or a bug in the
> >>>>>>>> optimizer (we have seen them before). I wonder if you can
> selectively
> >>>>>>>> optimize functions. Maybe a good start is compiling everything -O3
> >>>>>>>> except the atomic execute function and make sure it still works.
> >>>>>>>>
> >>>>>>>> Ali
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <
> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>> Yes, I think there exists at least one software IEEE FP
> >>>>>>>>> implementation out
> >>>>>>>>> there that we had talked about incorporating at some point (long
> ago).
> >>>>>>>>> Unfortunately, as is discussed below, that's not even the issue,
> as we
> >>>>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
> >>>>>>>>> not-even-close) semantics of the hardware alone, which would
> require
> >>>>>>>>> more
> >>>>>>>>> effort.
> >>>>>>>>>
> >>>>>>>>> If someone really cared about modeling the ISA FP support
> precisely that
> >>>>>>>>> would be an interesting project, and if it was done cleanly
> (probably
> >>>>>>>>> with
> >>>>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
> >>>>>>>>>
> >>>>>>>>> Ironically I think the issue here is not that the HW FP is not
> good
> >>>>>>>>> enough
> >>>>>>>>> for our purposes, it's that the software stack doesn't give us
> clean
> >>>>>>>>> enough
> >>>>>>>>> access to the HW facilities (gcc in particular, though C itself
> may
> >>>>>>>>> share
> >>>>>>>>> part of the blame).
> >>>>>>>>>
> >>>>>>>>> Steve
> >>>>>>>>>
> >>>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <
> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I think there was talk of an FP emulation library a long time
> ago
> >>>>>>>>>> (before I was involved with M5) but we decided not to do
> something like
> >>>>>>>>>> that for some reason. Using regular built in FP support gets us
> most of
> >>>>>>>>>> the way with minimal hassle, but then there are situations like
> this
> >>>>>>>>>> where it really causes trouble. I presume the prior discussion
> might
> >>>>>>>>>> have been about whether getting most of the way there was good
> enough,
> >>>>>>>>>> and that it's simpler.
> >>>>>>>>>>
> >>>>>>>>>> Gabe
> >>>>>>>>>>
> >>>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
> >>>>>>>>>>> ----- Original Message ----- From: "Gabe Black"
> >>>>>>>>>> <[email protected]>
> >>>>>>>>>>> To: <[email protected]>
> >>>>>>>>>>> Sent: 25. октобар 2011 20:53
> >>>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
> >>>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <
> [email protected]>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>> [snip]
> >>>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion
> rather
> >>>>>>>>>> than a
> >>>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86.
> The
> >>>>>>>>>> default
> >>>>>>>>>>>> rounding mode *is* standard, though, and I don't think is
> >>>>>>>>>> adjusted in
> >>>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless
> I'm
> >>>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant
> problem.
> >>>>>>>>>>>> Whether instructions generate +/- 0 in various situations may
> >>>>>>>>>> depend on,
> >>>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm
> not
> >>>>>>>>>> sure
> >>>>>>>>>>>> that it does, but there are all kinds of weird, subtle
> behaviors
> >>>>>>>>>> with
> >>>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the
> wrong
> >>>>>>>>>> thing.
> >>>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it
> out
> >>>>>>>>>> with
> >>>>>>>>>>>> other FP operations. If we're running real x87 instructions
> (we
> >>>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those
> use
> >>>>>>>>>> 80 bit
> >>>>>>>>>>>> operands internally. Where and when rounding takes place
> depends
> >>>>>>>>>> on when
> >>>>>>>>>>>> those are moved in/out of the FPU, and will be different than
> >>>>>>>>>> true 64
> >>>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that
> should
> >>>>>>>>>>>> behave better. It should also be the default in 64 bit mode
> since
> >>>>>>>>>> the
> >>>>>>>>>>>> compiler can assume some basic SSE support is present.
> >>>>>>>>>>>>
> >>>>>>>>>>> What about FP emulation using integers and some kind of
> multiple
> >>>>>>>>>>> precision
> >>>>>>>>>>> arithmetic? Then every detail could be modeled, including x87
> >>>>>>>>>> "floats"
> >>>>>>>>>>> and
> >>>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not
> 8/11 and
> >>>>>>>>>>> makes
> >>>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
> >>>>>>>>>> proper
> >>>>>>>>>>> float/double). Gcc has some switches regarding that behavior
> but
> >>>>>>>>>> that is
> >>>>>>>>>>> very fragile (more like suggestion to compiler then enforcing
> >>>>>>>>>> option).
> >>>>>>>>>>> Double rounding in x87 is special story because double extended
> >>>>>>>>>>> mantissa is not more than twice longer then one for double so
> double
> >>>>>>>>>>> rounding can give different results compared to single
> rounding (this
> >>>>>>>>>>> situation can't happen
> >>>>>>>>>>> with float vs double). One solution, for example: splitting
> mantissas
> >>>>>>>>>>> into to halves and performing operation, all bits would be
> available
> >>>>>>>>>>> and then proper any kind of rounding could be enforced (real
> ieee or
> >>>>>>>>>>> "isa style ieee"). Performing those operations is not very slow
> >>>>>>>>>> and it
> >>>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure
> number
> >>>>>>>>>>> of instructions is compared (although to have robust code, cpu
> and
> >>>>>>>>>>> compiler independence, specially  about "optimizing code" some
> tests
> >>>>>>>>>>> are needed to eradicate subnormals due poor support/trap
> emulation).
> >>>>>>>>>>> Plus if instructions are mixed in right way both int and fpu
> units
> >>>>>>>>>> can
> >>>>>>>>>>> be kept busy. Exponent can be one short and problem solved.
> Only
> >>>>>>>>>>> division can be somewhattricky (and slow), but it can be done
> too.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> Even if the FP rounding error isn't the source of the
> problem,
> >>>>>>>>>> it might
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>> easiest to fix that and get it out of the way so we can see
> what
> >>>>>>>>>> the
> >>>>>>>>>>>>> actual
> >>>>>>>>>>>>> problem is.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If you really want to know *why* the kernel is doing all this
> >>>>>>>>>> FP, then
> >>>>>>>>>>>>> yes,
> >>>>>>>>>>>>> you probably need to look at the source code.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Steve
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>>> [email protected]
> >>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>> [email protected]
> >>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> gem5-dev mailing list
> >>>>>>>>> [email protected]
> >>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>> _______________________________________________
> >>>>>>>> gem5-dev mailing list
> >>>>>>>> [email protected]
> >>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>> _______________________________________________
> >>>>>>> gem5-dev mailing list
> >>>>>>> [email protected]
> >>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>> _______________________________________________
> >>>>>> gem5-dev mailing list
> >>>>>> [email protected]
> >>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>> _______________________________________________
> >>>>> gem5-dev mailing list
> >>>>> [email protected]
> >>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> [email protected]
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> [email protected]
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >> _______________________________________________
> >> gem5-dev mailing list
> >> [email protected]
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> > _______________________________________________
> > gem5-dev mailing list
> > [email protected]
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Failed SPARC test

Reply via email to