On Jun 6, 2010, at 4:53 PM, Steve Reinhardt wrote: > I've only thought about this briefly, but here are a few quick reactions: > > - PowerPC has updating ld/st instructions too. How are these handled? > Whatever we do, we should do the same thing for both. Tim, care to comment?
> - How would a real pipeline handle this, particularly for loads? In > that case you have two register updates from one instruction, and > assuming the datapath is designed for the typical case of one register > update per instruction, you may well end up doing something that looks > like a poor man's micro-ops anyway (e.g., it seems likely that it > would occupy two slots in the writeback stage). And if you want to > decouple those two writebacks as much as possible in O3, it's not > clear how you'll do that w/o either creating two micro-ops or > significantly revamping the cpu/inst interface. So it's not obvious > to me that that solution is overkill. It seems to me like the first register write could happen after the address generation completed successfully. The thing is we don't support that model in the timing simple cpu, although we do support it in the o3 cpu because of the speculative state. I'm fine changing those instructions to be macroops, but I still find it non-intuitive that code that works in the atomic case and works in the o3 case breaks with the timing simple cpu (admittedly the atomic cpu and o3 use different methods to execute). This bothers me on two levels: 1) that it's possible to create such a situation; 2) our abstractions seem to break down in this case. It seems like we either need a better abstraction, or should modify the isa parser to through an error if a register write happens in initiateAcc(). Ali > On Sun, Jun 6, 2010 at 8:10 AM, Ali Saidi <[email protected]> wrote: >> The problem we are having is something like this. With ARM there are some >> incrementing loads/stores that do the following: >> addr_tmp = Rm + Rn >> Rt = MemoryLoad(addr_tmp) >> Rn = addr_tmp >> >> In the case of the atomic simple cpu this works fine because the Rn = addr >> is protected by a if (fault == NoFault) from the MemoryLoad. >> >> For the timing CPUs, all of the above code is in the initateAcc(). With the >> new delayed translation code MemoryLoad() will always return NoFault so the >> Rn = addr code will always be executed. With the O3 this appears like it >> will work since upon taking a fault on the instruction, none of updated >> state will be saved, however with the timing simple cpu there isn't any >> duplicate state, so the Rn = addr is executed and later when the timing >> translation completes and takes a fault Rn is already updated and >> after the fault is handled and the instruction is re-executed it doesn't >> work so well. >> >> There seem to be three solutions. >> 1) Break the instruction into micro-ops, one that does the load and the >> other that does the writing of Rn. This is Ok, but it really seems like >> overkill. >> 2) Move the Rn = addr part to completeAcc(). This isn't at all ideal because >> while it would solve the problem for the timing simple cpu, it would make a >> tight loop without and loop carried dependencies only able to issue one >> outstanding load at a time with the o3 cpu. >> 3) Some how restructure the timing simple cpu code to do something more like >> the way the out-of-order cpu handles these issues, although I don't know >> about the implication of keeping around temporary state. Something along >> these lines seems ideal, since I don't think we really want completely >> different methods to handle these issues in the different CPU classes. >> 3a) One way to handle this might be instead of having the translation return >> NoFault when the transaction is delayed, return ReplayFault or something >> similar that would re-execute initateAcc(). This time the translation should >> be cached and thus immediately available so it can return the proper fault >> or NoFault. >> >> Anyone have some comments? >> >> Thanks, >> Ali >> >> >> On Jun 5, 2010, at 8:10 PM, Steve Reinhardt wrote: >> >>> I'd have to look more closely to decide... I like having the common >>> parts of simple-atomic and simple-timing factored out into the base >>> class, so I'd hate to just completely throw that away unless it's >>> really justified, but it's quite possible that the partition/interface >>> between the base class and the derived classes needs to be rethought. >>> >>> Steve >>> >>> On Sat, Jun 5, 2010 at 5:18 PM, Gabriel Michael Black >>> <[email protected]> wrote: >>>> Maybe it should be separated from the simple atomic CPU again? That would >>>> have made sense back when they were mostly the same, but they might be >>>> different enough now that it just adds to the confusion. I'm not advocating >>>> that necessarily, but it sounds like it could help. >>>> >>>> Gabe >>>> >>>> Quoting Steve Reinhardt <[email protected]>: >>>> >>>>> It could use some cleanup. The idea has always been that it's the >>>>> simplest CPU that can drive the timing memory system, so there's no >>>>> guarantee it will be simple in an absolute sense. But I'm sure it >>>>> could be simpler if we worked at it. I think the combination of >>>>> dealing with unaligned accesses and split-phase translations has >>>>> definitely hit it hard. >>>>> >>>>> Steve >>>>> >>>>> On Sat, Jun 5, 2010 at 1:45 PM, Ali Saidi <[email protected]> wrote: >>>>>> >>>>>> My comment was more that we might want to re-evaluate the timing simple >>>>>> cpu as it's left the simple bit a while ago. Many of the fault == NoFault >>>>>> checks don't matter with the translation state change since it's always >>>>>> going to return no fault. >>>>>> >>>>>> >>>>>> I was hoping to get a discussion going about what if anything we should >>>>>> do. >>>>>> >>>>>> Ali >>>>>> >>>>>> On Jun 4, 2010, at 5:29 PM, nathan binkert wrote: >>>>>> >>>>>>>> On a somewhat related note, it seems like the timing simple cpu control >>>>>>>> flow has gotten pretty messy with the addition of the translation state >>>>>>>> code. To figure out what is going on you really need a large white >>>>>>>> board >>>>>>>> and a lot of time to map out the call tree and what happens on faults >>>>>>>> and >>>>>>>> the like. >>>>>>> >>>>>>> Ali, >>>>>>> >>>>>>> If it is correct, can you commit the patch? If you're not the right >>>>>>> person, Gabe? >>>>>>> >>>>>>> Nate >>>>>>> _______________________________________________ >>>>>>> m5-dev mailing list >>>>>>> [email protected] >>>>>>> http://m5sim.org/mailman/listinfo/m5-dev >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> m5-dev mailing list >>>>>> [email protected] >>>>>> http://m5sim.org/mailman/listinfo/m5-dev >>>>>> >>>>> _______________________________________________ >>>>> m5-dev mailing list >>>>> [email protected] >>>>> http://m5sim.org/mailman/listinfo/m5-dev >>>>> >>>> >>>> >>>> _______________________________________________ >>>> m5-dev mailing list >>>> [email protected] >>>> http://m5sim.org/mailman/listinfo/m5-dev >>>> >>> _______________________________________________ >>> m5-dev mailing list >>> [email protected] >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >> >> _______________________________________________ >> m5-dev mailing list >> [email protected] >> http://m5sim.org/mailman/listinfo/m5-dev >> > _______________________________________________ > m5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/m5-dev > _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
