On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <[email protected]> wrote:

> Ah, ok, I was just being dumb. All the stdf-s and lddf-s are just moving
> memory around, I think. That way you can load/store 64 bits at a time
> and get it done with fewer instructions. I think those instructions
> themselves can be ignored.


If what you mean is that the actual problem is induced in an FP operation
and not in the stdf/lddf itself, then yes, it looks like you're right.  Note
that in the detailed tracediff below, the original divergence is on the
result of an fsubd.  I think there are quite a few FP ops that are giving
slightly different results before one shows up in the exec trace, and the
reason appears to be that the data field output on FP op exec tracing is
broken... maybe we're only properly reading one register from the register
pair?  So I think the only reason the error first shows up in a stdf in the
exec trace is because that's the first instruction where the trace output
isn't broken.

I created a /tmp/sparc-error directory on zizzer, moved the original
tracediff in there, and also copied two new files: pre-error-trace.out and
detailed-tracediff.out.  Hope the names are self-explanatory. Now you have
access to all the traces I generated.



> I'm also surprised that there would be much
> floating point.
>

Yea, and it's really weird stuff too... almost like they're running tests on
the FPU or something:

931697674: system.cpu T0 : 0xff1aa4b0    :      faddd   %f21,%f20,%f20    :
FloatAdd :  D=0x00000000c0000000
931697675: system.cpu T0 : 0xff1aa4b4    :      fsubd   %f17,%f16,%f28    :
FloatAdd :  D=0x00000000c0000000
931697676: system.cpu T0 : 0xff1aa4b8    :      faddd   %f19,%f18,%f4     :
FloatAdd :  D=0x00000000c0000000
931697677: system.cpu T0 : 0xff1aa4bc    :      fsubd   %f3,%f2,%f0       :
FloatAdd :  D=0x00000000c0000000
931697678: system.cpu T0 : 0xff1aa4c0    :      faddd   %f7,%f6,%f14      :
FloatAdd :  D=0x00000000c0000000
931697679: system.cpu T0 : 0xff1aa4c4    :      fsubd   %f5,%f4,%f30      :
FloatAdd :  D=0x00000000c0000000
931697680: system.cpu T0 : 0xff1aa4c8    :      faddd   %f11,%f10,%f6     :
FloatAdd :  D=0x00000000c0000000
931697681: system.cpu T0 : 0xff1aa4cc    :      fcmpd   %f21,%f20,%fsr    :
FloatAdd :  D=0x00000000c0000000
931697682: system.cpu T0 : 0xff1aa4d0    :      faddd   %f7,%f6,%f18      :
FloatAdd :  D=0x00000000c0000000

Note also how the data field in the trace output is always the same, even
though the detailed tracediff shows that these instructions aren't always
producing the same values.

>
> I'm currently building binutils for SPARC, so hopefully I can
> disassemble some things and get a better idea of what's going on. It's
> probably going to be really annoying to figure it out.


If it's really just an FP rounding error, it might not be that hard... just
look at the examples from the trace of where it's going wrong, figure out
what the right answer is, and focus on those few instructions.  FP is pretty
thoroughly specified by IEEE, so if it's not an outright compiler bug, maybe
it's just some change in the default rounding settings or something.

Even if the FP rounding error isn't the source of the problem, it might be
easiest to fix that and get it out of the way so we can see what the actual
problem is.

If you really want to know *why* the kernel is doing all this FP, then yes,
you probably need to look at the source code.

Steve
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to