On 22/01/2018 23:53, Gabe Black wrote:

On Mon, Jan 22, 2018 at 2:38 AM, Andreas Sandberg 
<[email protected]<mailto:[email protected]>> wrote:
Hi Gabe,


On 21/01/2018 06:34, Gabe Black wrote:
Hi folks. ExtMachInst is a type which is defined per ISA, and is *almost*
not used outside of ISA specific code. The three uses for it that I see
right now are as a type in the decode cache when decoding instructions, the
protobuf based instruction tracer, and the MinorCPU.

The decode cache is tightly bound to the decode function which is already
highly ISA dependent, so I don't forsee a big problem pushing those
together and handling them at the same time.

Sounds reasonable. Except for detailed arch-specific timing models and
debugging, there should be little need to expose the encoding in the
pipeline.

A simple solution would be to just use a 64-bit ExtMachInst
representation in the StaticInst interface and just stub it for x86.
That would retain existing functionality and wouldn't be particularly
intrusive.


That would be an improvement, but I really want to avoid having features which 
are fundamentally incompatible with certain ISAs like that. That's opposed to 
features which just might not be implemented yet but which could with some 
straightforward and reasonable amount of effort.




The protobuf tracer unfortunately stores a raw 32 bit integer value to
record what the instruction was, and this field is required. I'd like to
make this field optional, and to also add an optional byte array which can
be used to store instructions which don't fit neatly into a 32 bit integer.
This shouldn't affect the efficiency of the protobuf format as it's used
today, it would just enable its use in more places. Also, the protobuf
tracer would need to call, say, a virtual function to encode the
instruction in whatever way is appropriate. Alternatively, all instructions
could be encoded as an array of bytes which would make that interface more
generic. It would be, I think, one byte less efficient storage wise when
storing instructions that are consistently 32 bits, but would be more
efficient for 16 bit instructions and equivalent for ones that would be
forced to use an array of bytes anyway.

Another option would be to completely remove the encoding from the Inst
message and add a add architecture-specific Encoding messages (possibly
using PB's extensions mechanism to extend Inst). Unfortunately, this has
the potentially make the code ugly if we want to maintain PB as an
optional dependency.


Yeah, I was thinking of how to avoid making this a protobuf tracer specific 
aspect of each StaticInst flavor. My best idea so far is to add a function 
which generically serializes an instruction as a sequence of bytes, and then 
use that to fill in the protobuf. If there are exactly 4 bytes returned, then 
you can use the old format which would be slightly more efficient. If there are 
more or less, then you'd use the new byte array version. That could also 
theoretically be used for other things like serializing inflight instructions 
or something, although realistically it would just be to loosen the coupling to 
the protobuf tracer in the foreseeable future.

This sounds like a good option and could be used to drive the instruction 
matcher in Minor. I'm not sure how it would affect performance compared to 
using a fixed-size ExtMachInst, but it probably isn't that bad.



The MinorCPU really should not be reaching into instructions and looking at
their bytes. The idea that there's necessarily some sort of mask and
compare that can be done to determine what type of operation an instruction
is is very fragile, not only because it assumes those operations make
sense, but also that the instructions actually have such a pattern. This is
evidenced by the fact that there's already an #ifdef around this bit of
code.

We actually make extensive use of this functionality. Take the HPI core
model in configs/common/cores/arm as an example. In order to get good
correlation against existing in-order Arm cores, we use the bit matching
functionality to route instructions to FUs and assign latencies to
individual instructions. This has allowed us to achieve much tighter
correlation against existing in-order cores than what we could achieve

Latencies and FU routing has traditionally been done using instruction
classes in gem5. However, because different cores classify instructions
in slightly different ways, this would make the ISA code timing model
specific. This is clearly undesirable.

Is the MinorCPU used? The easiest thing to do would be to nuke it, but if
anyone is using it or is particularly attached to it we should try to keep
it. Alternatively, the interesting bits could be returned by a virtual
function on the StaticInst object. This still assumes that there's some
sort of magical pattern to mask and compare against, but at least it will
compile for everything and hides the ExtMachInst type within the ISA.
Really, since we're already determining whether an instruction is a branch,
integer operation, etc., we should just use those flags and not try to
redecode the instruction in the CPU. That also leaks instruction encoding
into the configuration which makes that more fragile and ISA specialized as
well. Dealing with the MinorCPU is the part of this I most want feedback on.

The minor CPU is used a lot. It's the model you need to simulate a small
core in a bitLITTLE configuration. Unfortunately, we simply can't get
good enough correlation against existing designs if we use just the
instruction flags, so keeping the bit matching magic is a hard
requirement from our sides.


This is a problem. This sort of thing undermines ISA independence and having 
multiple ISAs in the same gem5 binary. I don't have any immediate suggestion, 
but finding a way to do this without exposing the actual binary instruction 
encodings will be necessary to get where we all want to go.

It isn't really undermining the idea of having multiple ISAs in the same binary. The 
biggest "problem" is that you won't be able to use a detailed timing model for 
one ISA with another ISA. That's probably fine though. I'm pretty sure it would be hard 
to make a timing model that is realistic for both Arm and x86 given that we most likely 
classify instructions in slightly different ways.

I think structurally what you're building is a function f(i,m) where I is what 
the instruction is and m is the given CPU model which outputs a latency. Right 
now you have that function implemented in python and batched into groups of 
instructions classified by masking and comparison. Of course in that case 
you're laying out the values for a given C, not defining the whole function 
F(). The base problem is classifying instructions into groups I where f(i, m) = 
c for all i in I. If the I-s are the same no matter what m, then you could 
apply classification in the decoder and use a function f(I, m) instead. That 
would break the dependence on what the actual i is and avoid having to 
dynamically group instructions after the fact effectively in the python config.

The problem is that different microarchitectures would need different 
classifications (i.e., different uarchs route the same instruction to 
differently) and some instructions don't have fixed latencies (e.g., division). 
We could grow the number of instruction classes, but that probably won't scale 
since each new microarchitecture we want to model would need new instruction 
classes. Similarly, each time we implement a new divider (or some other clever 
gadget), we'd need to implement a custom timing model in C++.

Alternatively you could make your decoder programmable where it would take some 
sort of classifier which would apply groupings in the decoder itself? This 
would also make the CPU model more efficient since the instructions aren't 
going to change groups, but they still get reclassified every time the execute.

That's definitely a possibility, but if we make that classifier a part of the 
C++ world, we effectively encode parts of our timing models in C++. That would 
be highly undesirable and would make it a lot harder to make and distribute new 
custom timing models. This sort of mechanism could work if we make instruction 
classification programmable from Python and add the ability to define custom 
instruction classes in Python (I'm not sure how different it would be from what 
we do currently though). It wouldn't solve the issue for variable latency 
instructions though.

//Andreas
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to