Hello again. I've had a chance to talk with an expert, and I have an idea of how to approach this. It's going to require more flexibility than the ISA parser has currently, though, specifically in how the list of source and destination registers are managed. It would also be nice to have a more integrated idea of composite operands, ie. ones where some bits come from here, some from there, and in the end it builds a single uint64_t, double precision float, vector of uint32_ts, etc.

Rather than try to shoe horn this into a system that's already suffered enough of my abuse, aka the ISA description language, I'm going to attempt to build a parallel facility for defining instructions usable from inside the python in "let" blocks. Basically it would be python classes, functions, etc., (hopefully not that many) exported into the let block context that would allow more direct interaction with the parser's guts, and more control over how things are put together.

In the future I'd like to see this bud into isa_parser2.py, but that's going to be a lot of work and is a somewhat orthogonal issue. Ideally this sort of thing will also make it easier to split output into smaller files.

Gabe

Quoting Gabe Black <gbl...@eecs.umich.edu>:

I'm looking at why x86 goes so much slower than Alpha on O3 (4x the
ticks), and I think one culprit are dependencies set up by the condition
code bits of the flags register. Many instructions in x86 modify or
depend on those bits, and even though the condition codes are separated
out from the flags register (which does a lot of other stuff too),
they're being updated with a read-modify-write sort of mechanism. I
expect that's setting up long chains of serializing dependencies which
is killing parallelism and performance.

Basically, There are 6 condition codes in x86, Z, C, A, S, P, O or zero,
carry, auxiliary carry, sign, parity and overflow. In M5's
implementation (and in the patent I patterned it after) there are also
artificial "emulation" zero and carry flags that work like the regular
ones but are maintained separately. They can be updated independently
and checked separately, and are useful behind the scenes when
implementing some macroops. Instructions may update all of these flags
or only some of them. The PTLSim manual claims that there's a "ZAPS"
rule where the zero, auxiliary carry, parity and sign bits are always
updated together. That's usually true, but certain instructions change
only the zero flag. CMPXCHG8B is an example.

What I'd been thinking of doing to handle this is to further split up
the condition code bits into separate registers to be managed
independently for any register renaming. There are a couple of issues
with that, though. First, it looks like there'd have to be 6 different
registers, APS, Z, O, C, EZ, and EC. A non-trivial number of
instructions would need to update 4 or more of those, putting a perhaps
unrealistic burden on any rename mechanism. That would also make the
simple CPUs slower because they'd have to read/write all those extra
registers. Bread and butter x86 tends to be condition code happy, so
that could be a significant slow down.

Also, that complicates decoding significantly. Conceptually it's easy to
imagine reading/writing the registers with the bits you need, but with
the ISA parser, the code needs to either be there or not be there. If
you have code that's never used but accesses a register, it'll still get
pulled in as a source or dest. That means there would need to be a hard
coded version of every microop that would correspond to each possible
combination of condition code bits. Since there are 6 bits, that's 2^6,
plus 2 variants for partial or complete register writes, so 2^7 or 128
versions of every microop. There are also register/immediate versions of
many microops. We would likely end up with thousands of microop classes.
We'd also need to generate selection functions that would pick which
variant to use. This is all possible, but fairly ugly and clunky.

So does anybody have any suggestions on how to unserialize these
microops? I found a paper here:
http://www.wseas.us/e-library/conferences/2006elounda1/papers/537-325.pdf
that claims IPC for x86 CPUs is significantly worse than other ISAs
specifically because of this sort of thing. Is this just a fact of life
with x86? Would fixing it be not only very annoying but also
unrealistic? Is that paper's claim actually true?

Gabe
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev



_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to