----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://reviews.gem5.org/r/3029/#review7219 -----------------------------------------------------------
src/cpu/trace/trace_cpu.cc (lines 741 - 795) <http://reviews.gem5.org/r/3029/#comment6140> Would it make sense to use either a heap or an orderded map? Assuming that you only do insertions and deletions, and the trace is long running, so on average there are n nodes in the list, and you make m insertions and m deletions over the trace length and m >> n. Then, the list would require O(mn) time for execution, while heap or ordered map would require O(m log n) time. Had you been using a vector instead of list, then cache hits due to contiguity can help, but both list and map/heap would typically miss in the cache, so I don't seen why we should be using a list. src/cpu/trace/trace_cpu.cc (line 842) <http://reviews.gem5.org/r/3029/#comment6139> Strange name. src/cpu/trace/trace_cpu.cc (lines 873 - 883) <http://reviews.gem5.org/r/3029/#comment6138> Why call a function for stores and handle loads in place when both need same amount of processing? src/cpu/trace/trace_cpu.cc (lines 1022 - 1024) <http://reviews.gem5.org/r/3029/#comment6137> I read the code on how curr and next elements are being set and used. If my understanding is correct, I think we need only one of the variables. I the only line which makes use of nextElement is line 1023. Instead, we can read the trace into currElement all the time. Record the tick value before calling nextExecute() on line 1022 and use the recorded tick value in line 1023. This will avoid all the coping from next to curr. src/cpu/trace/trace_cpu.cc (lines 1120 - 1123) <http://reviews.gem5.org/r/3029/#comment6136> I would really prefer if we have braces here. Had it been just the line: retryPkt = pkt, I would not have asked for braces. src/cpu/trace/trace_cpu.cc (lines 1304 - 1337) <http://reviews.gem5.org/r/3029/#comment6135> I think the two functions should behave in the same way. That is, both of them should return false in case no dependency was found. The function removeDepOnInst() should check the return value from removeRegDep(). - Nilay Vaish On Aug. 11, 2015, 9:05 p.m., Curtis Dunham wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > http://reviews.gem5.org/r/3029/ > ----------------------------------------------------------- > > (Updated Aug. 11, 2015, 9:05 p.m.) > > > Review request for Default. > > > Repository: gem5 > > > Description > ------- > > This patch defines a TraceCPU that replays trace generated using the elastic > trace probe attached to the O3 CPU model. The elastic trace is an execution > trace with data dependencies and ordering dependencies annoted to it. It also > replays fixed timestamp instruction fetch trace that is also generated by the > elastic trace probe. > > The TraceCPU inherits from BaseCPU as a result of which some methods need > to be defined. It has two port subclasses inherited from MasterPort for > instruction and data ports. It issues the memory requests deducing the > timing from the trace and without performing real execution of micro-ops. > As soon as the last dependency for an instruction is complete, > its computational delay, also provided in the input trace is added. The > dependency-free nodes are maintained in a list, called 'ReadyList', > ordered by ready time. Instructions which depend on load stall until the > responses for read requests are received thus achieving elastic replay. If > the dependency is not found when adding a new node, it is assumed complete. > Thus, if this node is found to be completely dependency-free its issue time is > calculated and it is added to the ready list immediately. This is encapsulated > in the subclass ElasticDataGen. > > If ready nodes are issued in an unconstrained way there can be more nodes > outstanding which results in divergence in timing compared to the O3CPU. > Therefore, the Trace CPU also models hardware resources. A sub-class to model > hardware resources is added which contains the maximum sizes of load buffer, > store buffer and ROB. If resources are not available, the node is not issued. > The 'depFreeQueue' structure holds nodes that are pending issue. > > Modeling the ROB size in the Trace CPU as a resource limitation is arguably > the > most important parameter of all resources. The ROB occupancy is estimated > using > the newly added field 'robNum'. We need to use ROB number as sequence number > is > at times much higher due to squashing and trace replay is focused on correct > path modeling. > > A map called 'inFlightNodes' is added to track nodes that are not only in > the readyList but also load nodes that are executed (and thus removed from > readyList) but are not complete. ReadyList handles what and when to execute > next node while the inFlightNodes is used for resource modelling. The oldest > ROB number is updated when any node occupies the ROB or when an entry in the > ROB is released. The ROB occupancy is equal to the difference in the ROB > number > of the newly dependency-free node and the oldest ROB number in flight. > > If no node dependends on a non load/store node then there is no reason to > track > it in the dependency graph. We filter out such nodes but count them and add a > weight field to the subsequent node that we do include in the trace. The > weight > field is used to model ROB occupancy during replay. > > The depFreeQueue is chosen to be FIFO so that child nodes which are in > program order get pushed into it in that order and thus issued in the in > program order, like in the O3CPU. This is also why the dependents is made a > sequential container, std::set to std::vector. We only check head of the > depFreeQueue as nodes are issued in order and blocking on head models that > better than looping the entire queue. An alternative choice would be to > inspect > top N pending nodes where N is the issue-width. This is left for future as the > timing correlation looks good as it is. > > At the start of an execution event, first we attempt to issue such pending > nodes by checking if appropriate resources have become available. If yes, we > compute the execute tick with respect to the time then. Then we proceed to > complete nodes from the readyList. > > When a read response is received, sometimes a dependency on it that was > supposed to be released when it was issued is still not released. This occurs > because the dependent gets added to the graph after the read was sent. So the > check is made less strict and the dependency is marked complete on read > response instead of insisting that it should have been removed on read sent. > > There is a check for requests spanning two cache lines as this condition > triggers an assert fail in the L1 cache. If it does then truncate the size > to access only until the end of that line and ignore the remainder. > Strictly-ordered requests are skipped and the dependencies on such requests > are handled by simply marking them complete immediately. > > The simulated seconds can be calculated as the difference between the > final_tick stat and the tickOffset stat. A CountedExitEvent that contains > a static int belonging to the Trace CPU class as a down counter is used to > implement multi Trace CPU simulation exit. > > > Diffs > ----- > > src/cpu/trace/trace_cpu.hh PRE-CREATION > src/cpu/trace/trace_cpu.cc PRE-CREATION > src/cpu/trace/SConscript PRE-CREATION > src/cpu/trace/TraceCPU.py PRE-CREATION > > Diff: http://reviews.gem5.org/r/3029/diff/ > > > Testing > ------- > > > Thanks, > > Curtis Dunham > > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
