One other thing I just realized is a potential issue is where mispredicts are sent in, for instance, o3. If it's a PC mispredict, it should go to fetch. If it's a uPC mispredict and the PC isn't in decode, then it should go to fetch. If it's a uPC mispredict and the PC is in decode, you could argue either way but it could go to decode. If it was a mispredict (or interrupt or fault) that goes to someplace in the ROM, fetch is irrelevant since the memory isn't going to be used, and you'd pay a latency penalty going through fetch to get a value you're going to ignore anyway. That's not a concern now, but it's something to think about for the distant future when x86 works with a model that can mispredict.
Gabe Gabe Black wrote: > I think we're talking about mostly the same thing. The ROM bit would be > global, but in the same sense that the PC is global. It carries from uop > to uop passively as they flow through until you hit a point where you're > moving to a new macroop or into the ROM. It would be associated with a > given uop which is already associated with a given PC and uPC, so if you > had to go back to uop X which came from the ROM, it'd go to the right > place. It'd be basically like a third, single bit PC. I'd like something > conceptually similar to NPC to change it as well. Maybe there would be > two bools, fromRom and nextFromRom? Those names aren't that great, but > you get the idea. > > Gabe > > Steve Reinhardt wrote: > >> I'm a little confused... when you say microbranches are absolute, do >> you mean the target is an absolute offset within the sequence of uops >> generated by a macroinstruction? >> >> The sort of model that comes to mind based on your description is: >> >> - Use a bit somewhere *associated with the uop* that indicates whether >> you're fetching from the ROM or not. Making this a bit in the PC >> (whether it's a high-order bit or a low-order bit) isn't critical, but >> it worked well for Alpha PALcode so I don't see why it's any worse of >> an idea in this situation. I think the key is to make it per-uop and >> not a global mode because otherwise as you mentioned in an earlier >> email getting it fixed up right on misspeculations would be a pain. >> Having it per-uop also lets you look at it at any stage of the >> pipeline and still get the right answer regardless of what else is in >> other stages of the pipe. Again, basically the same motivations for >> Alpha encoding PAL mode in the low-order bit of the PC. >> >> - Have two flavors of microbranches: a relative microbranch (for which >> a signed 8-bit offset probably is adequate) for branches within flows >> (whether they're combinational decodes or from the ROM); and an >> absolute microbranch-to-ROM that has a larger target address field >> (probably big enough to go anywhere in the ROM) and that sets the "ROM >> bit" for the target uop even if it wasn't previously set. >> >> Does that make sense? >> >> Steve >> >> On Tue, Sep 16, 2008 at 8:23 PM, Gabe Black <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>> wrote: >> >> I hadn't considered that the decode function could be a dominant >> factor >> in the decode overhead. How much time do you think we spend actually >> allocating a StaticInst itself? In any case, it won't be as bad as it >> could be and it should work to generate the ROM static insts every >> time. >> I had also considered non-static StaticInsts and added a DynamicInst >> like layer, but I decided against them for the same reasons I >> think you >> don't like them. It adds a lot of complexity and changes a lot of code >> for dubious benefit performance wise, at least possibly. >> >> My comment about micropc relative branches also applies to absolute >> branches, which is what x86 actually uses right now, when branching >> between the combinational and ROM based microops. Basically, you >> have to >> jump over a large swath of the micropc space to get from wherever the >> combinational microops live to the right area of the ROM, and >> because of >> how the microbranch is implemented, it's limited to 8 bit >> immediates to >> store the offset. It forms the new micropc using a register and an >> immediate or two registers so you could technically put a larger value >> in a register, but that would be pretty clumsy for every instruction >> going to the ROM. Another option would be to make the microbranch >> -always- go to the ROM, but then all the macroops with branches would >> break. I'd like to be able to fix them gradually rather than take x86 >> out of commission for a month. The 8 bit limit is an effect of how the >> microcode ISA from that patent is put together so I think we >> should keep >> it. Even if it's painful, it should give more realistic behavior. It >> seems like I'd probably actually have to change the microbranches >> to be >> relative instead of absolute (I went with absolute since it was easier >> to assemble) so that you can branch around in large addresses like you >> might find in a ROM without having to have a larger immediate their >> either. Fortunately, the branches are almost all targeted at symbolic >> labels that get munged with a python function exposed to the microcode >> listing (yeah, I'll document that at some point), so that shouldn't be >> -too- hard to change. The big exception that comes to mind is CPUID >> which computes a branch target to simulate a big case statement, sort >> of, but one instruction shouldn't be too hard to deal with. >> >> I originally wanted to use a bit in the micropc, really an offset, to >> indicate ROM vs. combinational, but there are several problems. First, >> you have to introduce this magic flag, the bit in question, to >> cause the >> underlying mechanism to behave differently. You might say this isn't >> anything different than a memory mapped device, but that isn't >> entirely >> true. In this case, using the ROM cuts some steps off of the beginning >> of the fetch-decode process which may fail or not make sense, like >> microcoding entering an interrupt handler. In that particular >> case, the >> entry point is in a table in memory, so the microcode needs to run to >> look up what the PC will be. The PC is undefined up to that point, so >> there can't be a fetch or decode of real life instruction memory. The >> front end can't even -try- to bring in a macroop to ignore, because >> there's no way to guarantee it won't fail and fault spuriously and >> short >> circuit your microcode. The bit would toggle all that on and off, and >> that seems a little too mysterious to me. I think it'd be easier >> and/or >> better to have a separate piece of state which you toggle explicitly >> which has all those effects and has a name which clearly indicates >> what >> it's doing. Also, one minor thing is that you have to constantly check >> that bit to see what you should be doing since the micropc is >> constantly >> changing. If you had a big event that caused the switch and set things >> up and then otherwise acted normally, you could just run assuming you >> were set up to do the right thing. >> >> Gabe >> _______________________________________________ >> m5-dev mailing list >> [email protected] <mailto:[email protected]> >> http://m5sim.org/mailman/listinfo/m5-dev >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> m5-dev mailing list >> [email protected] >> http://m5sim.org/mailman/listinfo/m5-dev >> >> > > _______________________________________________ > m5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/m5-dev > _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
