Bleah, this is ugly! Reading that one bug report Gabe linked to, it sounds like -frounding-math is supposed to make this work, but it's not correctly implemented, and as a result there's really no straightforward way to make this work. I think that should be documented somewhere so that one day, if -frounding-math does get implemented properly, we can start relying on it and not on whatever hack we come up with.
Another idea, assuming m5_fesetround() isn't inlined, would be to have it accept a double argument that it just passes back unmodified. Then you could do something like: Frs1s = m5_fesetround(newrnd, Frs1s); Frds = Frs1s + Frs2s; Frds = m5_fesetround(oldrnd, Frds); Would that work? Steve On Sat, Oct 29, 2011 at 4:51 PM, Gabe Black <[email protected]> wrote: > I don't think either will work because it's not the optimizations in > those functions or the functions order relative to each other or the > asms, it's the position of the add relative to the asms. Since the add > can move around wherever, it doesn't matter if the calls to fesetround > are bounded by the asms. We could potentially mark the execute function > with a different optimization level though. That might work. Also, I > have that filterDoubles function in there that finds fp operands that > are doubles and builds them from or breaks them down into single floats. > We could possibly piggyback on that to add in asms with the right > properties like in ARM. It's a bit gross, but like you said I don't know > if we can avoid that. > > Gabe > > On 10/29/11 16:31, Ali Saidi wrote: > > If we go down the path below, slighly less hacky might be just making > gcc compiler the entire fenv file without optimization, although perhaps > that is insufficient.... > > > > Ali > > > > On Oct 29, 2011, at 6:30 PM, Ali Saidi wrote: > > > >> What about making m5_fesetround and m5_fegetround() modify memory and > thus prevent reordering? > >> > >> Something like: > >> > >> volatile int dummy_compiler; > >> > >> void m5_fesetround(int rm) > >> { > >> assert(rm >= 0 && rm < 4); > >> dummy_compiler++; > >> fesetround(m5_round_ops[rm]); > >> dummy_compiler++; > >> } > >> > >> int m5_fegetround() > >> { > >> int x; > >> dummy_compiler++; > >> int rm = fegetround(); > >> dummy_compiler++; > >> for(x = 0; x < 4; x++) > >> if (m5_round_ops[x] == rm) > >> return x; > >> abort(); > >> return 0; > >> } > >> > >> Would that just fix it? Mabye m5_round_ops and rm could be made > volatile instead? > >> > >> Another possible solution and hack, but I think we're into hack > territory no matter what since gcc seems brain damaged in this regard: > >> > >> #if __GNUC__ > 3 && __GNUC_MINOR__ > 3 // 4.4 or newer > >> #pragma GCC push_options > >> #pragma GCC optimize ("O0") > >> > >> // m5_fe* goes here > >> > >> #pragma GCC pop_options > >> #endif > >> > >> > >> A third option would be something like > >> > >> void __attribute__((optimize("O0")) m5_fesetround(int rm)... > >> > >> Ali > >> > >> > >> On Oct 29, 2011, at 4:59 PM, Gabe Black wrote: > >> > >>> http://permalink.gmane.org/gmane.comp.gcc.help/38146 > >>> > >>> On 10/29/11 14:21, Gabe Black wrote: > >>>> Yes, it doesn't work either. What makes the ARM asm statements work is > >>>> that they have input and output arguments. That ties them into the > data > >>>> flow graph having to do with those values, and they act as anchors, > >>>> forcing values to be produced by the time you get to the asm and not > to > >>>> be consumed before it. Here we're just saying not to trust memory from > >>>> before the asm, and since it's not *in* memory, the compiler merrily > >>>> ignores us. I had this problem with ARM initially too until I added > the > >>>> arguments. I've tried making floating point variables volatile to > ensure > >>>> they're in memory, and that doesn't work either. I think the actual > >>>> semantics of volatile are a little different than what most people > >>>> assume, although I don't remember what the distinction is. One option > >>>> might be to make the FP operation itself a virtual function. Then gcc > >>>> won't know what it does and will be less able to break things by > moving > >>>> things around. > >>>> > >>>> It seems like a pretty severe deficiency of gcc that there's no way to > >>>> make fesetround work properly. It becomes nearly worthless because you > >>>> can't make any assumptions about when it will actually be in effect. > >>>> That's what we have to work with, though. > >>>> > >>>> Gabe > >>>> > >>>> On 10/29/11 13:53, Ali Saidi wrote: > >>>>> I was just about to send a message about -frounding-math when I saw > yours. Interesting that the asm barriers appears to work with ARM. It feels > like there should be an explicit code motion barrier. Anyway, have we tried > compiling with the -frounding-math flag? > >>>>> > >>>>> > >>>>> > >>>>> Ali > >>>>> > >>>>> Sent from my ARM powered device > >>>>> > >>>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <[email protected]> > wrote: > >>>>> > >>>>>> Here's a discussion on the gcc mailing list of the thing I was > talking > >>>>>> about before that's supposed to fix this, I think. > >>>>>> > >>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678 > >>>>>> > >>>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all > be > >>>>>> registers. > >>>>>> > >>>>>> Gabe > >>>>>> > >>>>>> On 10/29/11 13:31, Gabe Black wrote: > >>>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU > >>>>>>> > >>>>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround> > >>>>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d > >>>>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi > >>>>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround> > >>>>>>> 0x00000000008d539e <+398>: mov %r15d,%edi > >>>>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround> > >>>>>>> > >>>>>>> > >>>>>>> This is, more or less, from the following code. > >>>>>>> > >>>>>>> > >>>>>>> __asm__ __volatile__ ("" ::: "memory"); > >>>>>>> int oldrnd = m5_fegetround(); > >>>>>>> __asm__ __volatile__ ("" ::: "memory"); > >>>>>>> m5_fesetround(newrnd); > >>>>>>> __asm__ __volatile__ ("" ::: "memory"); > >>>>>>> Frds = Frs1s + Frs2s; > >>>>>>> __asm__ __volatile__ ("" ::: "memory"); > >>>>>>> m5_fesetround(oldrnd); > >>>>>>> __asm__ __volatile__ ("" ::: "memory"); > >>>>>>> > >>>>>>> > >>>>>>> Note that the addition was moved out of the middle and fesetround > was > >>>>>>> called twice back to back, once to set the new rounding mode, and > once > >>>>>>> to set it right back again. > >>>>>>> > >>>>>>> Gabe > >>>>>>> > >>>>>>> On 10/28/11 08:31, Ali Saidi wrote: > >>>>>>>> I'm still not 100% convinced that this is it. I agree it's highly > >>>>>>>> likely, but it could be some other code movement or a bug in the > >>>>>>>> optimizer (we have seen them before). I wonder if you can > selectively > >>>>>>>> optimize functions. Maybe a good start is compiling everything -O3 > >>>>>>>> except the atomic execute function and make sure it still works. > >>>>>>>> > >>>>>>>> Ali > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt < > [email protected]> > >>>>>>>> wrote: > >>>>>>>>> Yes, I think there exists at least one software IEEE FP > >>>>>>>>> implementation out > >>>>>>>>> there that we had talked about incorporating at some point (long > ago). > >>>>>>>>> Unfortunately, as is discussed below, that's not even the issue, > as we > >>>>>>>>> really want to model the not-quite-IEEE (or in the case of x87, > >>>>>>>>> not-even-close) semantics of the hardware alone, which would > require > >>>>>>>>> more > >>>>>>>>> effort. > >>>>>>>>> > >>>>>>>>> If someone really cared about modeling the ISA FP support > precisely that > >>>>>>>>> would be an interesting project, and if it was done cleanly > (probably > >>>>>>>>> with > >>>>>>>>> the option to turn it on or off) we'd be glad to incorporate it. > >>>>>>>>> > >>>>>>>>> Ironically I think the issue here is not that the HW FP is not > good > >>>>>>>>> enough > >>>>>>>>> for our purposes, it's that the software stack doesn't give us > clean > >>>>>>>>> enough > >>>>>>>>> access to the HW facilities (gcc in particular, though C itself > may > >>>>>>>>> share > >>>>>>>>> part of the blame). > >>>>>>>>> > >>>>>>>>> Steve > >>>>>>>>> > >>>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black < > [email protected]> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> I think there was talk of an FP emulation library a long time > ago > >>>>>>>>>> (before I was involved with M5) but we decided not to do > something like > >>>>>>>>>> that for some reason. Using regular built in FP support gets us > most of > >>>>>>>>>> the way with minimal hassle, but then there are situations like > this > >>>>>>>>>> where it really causes trouble. I presume the prior discussion > might > >>>>>>>>>> have been about whether getting most of the way there was good > enough, > >>>>>>>>>> and that it's simpler. > >>>>>>>>>> > >>>>>>>>>> Gabe > >>>>>>>>>> > >>>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote: > >>>>>>>>>>> ----- Original Message ----- From: "Gabe Black" > >>>>>>>>>> <[email protected]> > >>>>>>>>>>> To: <[email protected]> > >>>>>>>>>>> Sent: 25. октобар 2011 20:53 > >>>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote: > >>>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black < > [email protected]> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>> [snip] > >>>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion > rather > >>>>>>>>>> than a > >>>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. > The > >>>>>>>>>> default > >>>>>>>>>>>> rounding mode *is* standard, though, and I don't think is > >>>>>>>>>> adjusted in > >>>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless > I'm > >>>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant > problem. > >>>>>>>>>>>> Whether instructions generate +/- 0 in various situations may > >>>>>>>>>> depend on, > >>>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm > not > >>>>>>>>>> sure > >>>>>>>>>>>> that it does, but there are all kinds of weird, subtle > behaviors > >>>>>>>>>> with > >>>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the > wrong > >>>>>>>>>> thing. > >>>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it > out > >>>>>>>>>> with > >>>>>>>>>>>> other FP operations. If we're running real x87 instructions > (we > >>>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those > use > >>>>>>>>>> 80 bit > >>>>>>>>>>>> operands internally. Where and when rounding takes place > depends > >>>>>>>>>> on when > >>>>>>>>>>>> those are moved in/out of the FPU, and will be different than > >>>>>>>>>> true 64 > >>>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that > should > >>>>>>>>>>>> behave better. It should also be the default in 64 bit mode > since > >>>>>>>>>> the > >>>>>>>>>>>> compiler can assume some basic SSE support is present. > >>>>>>>>>>>> > >>>>>>>>>>> What about FP emulation using integers and some kind of > multiple > >>>>>>>>>>> precision > >>>>>>>>>>> arithmetic? Then every detail could be modeled, including x87 > >>>>>>>>>> "floats" > >>>>>>>>>>> and > >>>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not > 8/11 and > >>>>>>>>>>> makes > >>>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be > >>>>>>>>>> proper > >>>>>>>>>>> float/double). Gcc has some switches regarding that behavior > but > >>>>>>>>>> that is > >>>>>>>>>>> very fragile (more like suggestion to compiler then enforcing > >>>>>>>>>> option). > >>>>>>>>>>> Double rounding in x87 is special story because double extended > >>>>>>>>>>> mantissa is not more than twice longer then one for double so > double > >>>>>>>>>>> rounding can give different results compared to single > rounding (this > >>>>>>>>>>> situation can't happen > >>>>>>>>>>> with float vs double). One solution, for example: splitting > mantissas > >>>>>>>>>>> into to halves and performing operation, all bits would be > available > >>>>>>>>>>> and then proper any kind of rounding could be enforced (real > ieee or > >>>>>>>>>>> "isa style ieee"). Performing those operations is not very slow > >>>>>>>>>> and it > >>>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure > number > >>>>>>>>>>> of instructions is compared (although to have robust code, cpu > and > >>>>>>>>>>> compiler independence, specially about "optimizing code" some > tests > >>>>>>>>>>> are needed to eradicate subnormals due poor support/trap > emulation). > >>>>>>>>>>> Plus if instructions are mixed in right way both int and fpu > units > >>>>>>>>>> can > >>>>>>>>>>> be kept busy. Exponent can be one short and problem solved. > Only > >>>>>>>>>>> division can be somewhattricky (and slow), but it can be done > too. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>> Even if the FP rounding error isn't the source of the > problem, > >>>>>>>>>> it might > >>>>>>>>>>>>> be > >>>>>>>>>>>>> easiest to fix that and get it out of the way so we can see > what > >>>>>>>>>> the > >>>>>>>>>>>>> actual > >>>>>>>>>>>>> problem is. > >>>>>>>>>>>>> > >>>>>>>>>>>>> If you really want to know *why* the kernel is doing all this > >>>>>>>>>> FP, then > >>>>>>>>>>>>> yes, > >>>>>>>>>>>>> you probably need to look at the source code. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Steve > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> gem5-dev mailing list > >>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> gem5-dev mailing list > >>>>>>>>>>>> [email protected] > >>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> gem5-dev mailing list > >>>>>>>>>>> [email protected] > >>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> gem5-dev mailing list > >>>>>>>>>> [email protected] > >>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> gem5-dev mailing list > >>>>>>>>> [email protected] > >>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>>>>> _______________________________________________ > >>>>>>>> gem5-dev mailing list > >>>>>>>> [email protected] > >>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>>>> _______________________________________________ > >>>>>>> gem5-dev mailing list > >>>>>>> [email protected] > >>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>>> _______________________________________________ > >>>>>> gem5-dev mailing list > >>>>>> [email protected] > >>>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>>> _______________________________________________ > >>>>> gem5-dev mailing list > >>>>> [email protected] > >>>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>> _______________________________________________ > >>>> gem5-dev mailing list > >>>> [email protected] > >>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>> _______________________________________________ > >>> gem5-dev mailing list > >>> [email protected] > >>> http://m5sim.org/mailman/listinfo/gem5-dev > >> _______________________________________________ > >> gem5-dev mailing list > >> [email protected] > >> http://m5sim.org/mailman/listinfo/gem5-dev > > _______________________________________________ > > gem5-dev mailing list > > [email protected] > > http://m5sim.org/mailman/listinfo/gem5-dev > > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
