On Mon, Jan 22, 2018 at 2:38 AM, Andreas Sandberg <[email protected]> wrote:
> Hi Gabe, > > > On 21/01/2018 06:34, Gabe Black wrote: > >> Hi folks. ExtMachInst is a type which is defined per ISA, and is *almost* >> not used outside of ISA specific code. The three uses for it that I see >> right now are as a type in the decode cache when decoding instructions, >> the >> protobuf based instruction tracer, and the MinorCPU. >> >> The decode cache is tightly bound to the decode function which is already >> highly ISA dependent, so I don't forsee a big problem pushing those >> together and handling them at the same time. >> > > Sounds reasonable. Except for detailed arch-specific timing models and > debugging, there should be little need to expose the encoding in the > pipeline. > > A simple solution would be to just use a 64-bit ExtMachInst > representation in the StaticInst interface and just stub it for x86. > That would retain existing functionality and wouldn't be particularly > intrusive. That would be an improvement, but I really want to avoid having features which are fundamentally incompatible with certain ISAs like that. That's opposed to features which just might not be implemented yet but which could with some straightforward and reasonable amount of effort. > > > The protobuf tracer unfortunately stores a raw 32 bit integer value to >> record what the instruction was, and this field is required. I'd like to >> make this field optional, and to also add an optional byte array which can >> be used to store instructions which don't fit neatly into a 32 bit >> integer. >> This shouldn't affect the efficiency of the protobuf format as it's used >> today, it would just enable its use in more places. Also, the protobuf >> tracer would need to call, say, a virtual function to encode the >> instruction in whatever way is appropriate. Alternatively, all >> instructions >> could be encoded as an array of bytes which would make that interface more >> generic. It would be, I think, one byte less efficient storage wise when >> storing instructions that are consistently 32 bits, but would be more >> efficient for 16 bit instructions and equivalent for ones that would be >> forced to use an array of bytes anyway. >> > > Another option would be to completely remove the encoding from the Inst > message and add a add architecture-specific Encoding messages (possibly > using PB's extensions mechanism to extend Inst). Unfortunately, this has > the potentially make the code ugly if we want to maintain PB as an > optional dependency. Yeah, I was thinking of how to avoid making this a protobuf tracer specific aspect of each StaticInst flavor. My best idea so far is to add a function which generically serializes an instruction as a sequence of bytes, and then use that to fill in the protobuf. If there are exactly 4 bytes returned, then you can use the old format which would be slightly more efficient. If there are more or less, then you'd use the new byte array version. That could also theoretically be used for other things like serializing inflight instructions or something, although realistically it would just be to loosen the coupling to the protobuf tracer in the foreseeable future. > > > The MinorCPU really should not be reaching into instructions and looking at >> their bytes. The idea that there's necessarily some sort of mask and >> compare that can be done to determine what type of operation an >> instruction >> is is very fragile, not only because it assumes those operations make >> sense, but also that the instructions actually have such a pattern. This >> is >> evidenced by the fact that there's already an #ifdef around this bit of >> code. >> > > We actually make extensive use of this functionality. Take the HPI core > model in configs/common/cores/arm as an example. In order to get good > correlation against existing in-order Arm cores, we use the bit matching > functionality to route instructions to FUs and assign latencies to > individual instructions. This has allowed us to achieve much tighter > correlation against existing in-order cores than what we could achieve > > Latencies and FU routing has traditionally been done using instruction > classes in gem5. However, because different cores classify instructions > in slightly different ways, this would make the ISA code timing model > specific. This is clearly undesirable. > > Is the MinorCPU used? The easiest thing to do would be to nuke it, but if >> anyone is using it or is particularly attached to it we should try to keep >> it. Alternatively, the interesting bits could be returned by a virtual >> function on the StaticInst object. This still assumes that there's some >> sort of magical pattern to mask and compare against, but at least it will >> compile for everything and hides the ExtMachInst type within the ISA. >> Really, since we're already determining whether an instruction is a >> branch, >> integer operation, etc., we should just use those flags and not try to >> redecode the instruction in the CPU. That also leaks instruction encoding >> into the configuration which makes that more fragile and ISA specialized >> as >> well. Dealing with the MinorCPU is the part of this I most want feedback >> on. >> > > The minor CPU is used a lot. It's the model you need to simulate a small > core in a bitLITTLE configuration. Unfortunately, we simply can't get > good enough correlation against existing designs if we use just the > instruction flags, so keeping the bit matching magic is a hard > requirement from our sides. > This is a problem. This sort of thing undermines ISA independence and having multiple ISAs in the same gem5 binary. I don't have any immediate suggestion, but finding a way to do this without exposing the actual binary instruction encodings will be necessary to get where we all want to go. I think structurally what you're building is a function f(i,m) where I is what the instruction is and m is the given CPU model which outputs a latency. Right now you have that function implemented in python and batched into groups of instructions classified by masking and comparison. Of course in that case you're laying out the values for a given C, not defining the whole function F(). The base problem is classifying instructions into groups I where f(i, m) = c for all i in I. If the I-s are the same no matter what m, then you could apply classification in the decoder and use a function f(I, m) instead. That would break the dependence on what the actual i is and avoid having to dynamically group instructions after the fact effectively in the python config. Alternatively you could make your decoder programmable where it would take some sort of classifier which would apply groupings in the decoder itself? This would also make the CPU model more efficient since the instructions aren't going to change groups, but they still get reclassified every time the execute. This sort of thing is always tricky because ideally we'd have something like g(i) and/or h(m) which would be independent and simpler to implement, but to get really high fidelity you need something which is quadratically bigger and more complicated. Hopefully by tucking the complexity away in the right places we can minimize the damage and still get the accuracy you want. > > Cheers, > Andreas > > > IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
