Thanks Joel! This email is excellent! I will make sure it is "required reading" by all gem5 developers at AMD. I have several thoughts below:
1) This is a great reminder that actions should be designed to be simple. Conditional checks should be handled by generating different events. Unfortunately we have protocols today that have pushed far too much logic within actions. That leads to the conservative resource checking and very difficult protocols to debug. I definitely want to take that lesson forward when reviewing future SLICC code. 2) You have a "ship it" from me. Thank you for catching and fixing the bug! 3) Yes, this has been a source of confusion for a long time. I remember looking at this a while back and noticing that different protocols were also not consistent either. I think the solution here is not to use an integer, but to introduce a priority data type that is used consistently in Ruby, SimpleNetwork, and Garnet. 4) I personally think the solution here is to get rid of the mandatory queue. I do not think it models a real resource and stalling it can lead to *huge* performance problems. Instead I would like to see all RubyPorts include two maps one for buffered, but not yet issued requests, and another for issued requests. 5) Have you posted a patch on this? If it is a simple fix, then I think it reasonable for the SimpleNetwork. However, we need to make sure we keep the SimpleNetwork simple. This absolutely sounds like an issue that Garnet should handle correctly. 6 & 7) So does this cover all usage of PacketQueues in Ruby (both QueuedMasterPorts and QueuedSlavePorts)? I believe all RubyPorts currently use the packet queues. I would like to see them all fixed and we are willing to help on this. 8) I think we have a different idea of what hardware looks like. Hardware does not access the cache tag and data arrays multiple times for coalesced requests. There is only one access. I believe your #9 agrees with that and somewhat contradicts what you are saying here. Currently all requests sent to the controller requires a separate access to the tags and data as specified in the .sm file. Also note that while the GPUCoalescer expects RfO protocol behavior, the VIPERCoalescer does not. 9) Coalescing in the caches adds a lot of unnecessary complexity to the protocol. It is far easier to do that in the RubyPorts. That is the point of the Sequencer and Coalescers (i.e. RubyPorts). Do the protocol independent work in the RubyPorts and then only implement the necessary protocol complexity in the cache controllers. I believe we all agree the RubyPorts need to be fixed, but I think we fix them by adding a second map for stalled, not yet issued requests, and removing the mandatory queue. 10) I'm a bit confused here. Transitions that allocate TBE entries, already check for TBE availability. As long as protocols do not use z_stalls on the mandatory queue (BTW, we should make a goal of removing all support of z_stalls) independent events that do not need a TBE entry should be triggered. Are you seeing different behavior? If so, what protocol? Thanks again Joel. This is a great reference and lists several items we should work on. Brad -----Original Message----- From: gem5-dev [mailto:[email protected]] On Behalf Of Joel Hestness Sent: Tuesday, February 02, 2016 12:50 PM To: gem5 Developer List Subject: [gem5-dev] Toward higher-fidelity Ruby memory: Using (actually) finite buffering Hi guys, I've been working on some benchmarks that place unique/new stresses on heterogeneous CPU-GPU memory hierarchies. In trying to tighten up the hierarchy performance, I've run into a number of strange cache buffering/flow control issues in Ruby. We've talked about fixing these things, but I've found a need to inventory all the places where buffering/prioritization needs work. Below is my list, which can hopefully serve as a starting point and offer a broader picture to anyone who wishes to use Ruby with more realistic memory hierarchy buffering. I've included my current status addressing each. Please let me know if you have any input or would like to help address the issues. Any help would be appreciated. Thank you! Joel 1) [status: not started] SLICC parses actions to accumulate the total buffer capacity as required by all enqueue operations within the action. Unfortunately, the resource checking is usually overly conservative resulting in two problems: A) Many actions contain multiple code paths and not all paths get executed to push requests into buffers. The actual resources required are frequently less than SLICC's parsed value. As an example, an action with an if-else block containing an enqueue() on both paths will parse to require two buffers even though only one or the other enqueue() will be called, but not both. B) The resource checking can result in poorly prioritized transitions if they require allocating more resources than other transitions. For instance, a high priority transition (e.g. responses) may require a slot in 2 separate buffers, while a lower priority transition (e.g. requests) may require one of those slots. If the higher priority transition gets blocked, the lower priority transition can be allowed to proceed, resulting in priority inversion and possibly even starvation. Performance debugging these issues could be exceptionally difficult. As an example of performance issues, in MOESI_hammer, a directory transition that would involve activity iff using a full-bit directory may register excessive buffer requirements and block waiting for the unnecessary buffers (even though the directory is not configured to use the full-bit data!). By manually hacking generated files to avoid these incorrect buffer requirements, I've already witnessed performance improvements of greater than 3%, and I haven't even stressed the memory hierarchy. 2) [status: complete, posted] Finite-sized buffers are not actually finite sized: When a Ruby controller stalls a message in a MessageBuffer, that message is removed from the MessageBuffer m_prio_heap and placed in m_stall_msg_map queues until the queue is reanalyzed by a wake-up activity in the controller. Unfortunately, when checking whether there is enough space in the buffer to add more requests, the measured size only considers the size of m_prio_heap, but not messages that might be in the m_stall_msg_map. In extreme cases, I've seen the m_stall_msg_map hold >500 messages in a MessageBuffer with size = 10. Here's a patch that fixes this: http://reviews.gem5.org/r/3283/ 3) [status: not started] Virtual channel specification and prioritization is inconsistent: Currently, in each cycle, the PerfectSwitch in the Ruby simple network iterates through virtual channels from highest ID to lowest ID, indicating that higher IDs have higher priority. By contrast, Garnet cycles through virtual channel IDs from lowest to highest, indicating that lower IDs have higher priority. Since SLICC controller files specify virtual channels independent of the interconnect that is used with Ruby, the virtual channel prioritization may be inverted depending on the network that is used. The different Ruby network models need to agree on the prioritization in order to avoid potential priority inversion. 4) [status: not started] Sequencers push requests into mandatory queues regardless of whether the mandatory queue is finite-sized and possibly full. With poorly configured sequencer and L1 mandatory queue, it is possible to fill the L1 mandatory queue, but still have space in the sequencer's requestTable. Since the Sequencer doesn't check whether the mandatory queue has slots available, it cannot honor the mandatory queue's capacity correctly. This should be fixed and/or a warn/fatal should be raised to let the user know about poor configuration. 5) [status: complete, revising] SimpleNetwork access prioritization is not suited for finite buffering + near-peak bandwidth: The PerfectSwitch uses a round-robin prioritization scheme to select the input ports that have priority to issues to an output port, and it steps through input ports in ascending order to find one with ready messages. When some input port buffers are full and others are empty, the lower ID input ports effectively get prioritization when the round-robin ID is greater than the highest ID input port that has messages. For fair prioritization, input ports with ready messages should not be allowed to issue twice before other input ports with ready messages are allowed to issue once. My cursory inspection of Garnet routers suggests that they probably suffer from the same arbitration issue. 6) [status: complete, revising] QueuedMasterPort used for requests from Ruby directories to memory controllers: This fills up very quickly with a GPU requester, and results in the PacketQueue triggering the panic that the queue is too large (>100 packets). The RubyMemoryControl has infinite input port queuing, so it can be used, but other memory controllers cannot. Further, I have measured that even with roughly reasonable buffering throughout the memory hierarchy, average memory stall cycles in the Ruby memory controller in-queue can be upwards of 5,000 cycles (which is nonsensical). To fix this, we need to pull the queuing out of the Directory_Controller memory port and into a finite queue managed by the *- dir.sm files, and handle flow control in the port to memory. I have mostly implemented this, and will post a patch for review soon. 7) [status: partially implemented] QueuedSlavePort from memory controllers back to Ruby directories: After fixing the memory controller input queues, bloated buffering immediately jumps to the memory controller response queues, which are implemented as QueuedSlavePorts. I've started trying to fix this up in the DRAMCtrl, but given the complexity, have yet to finish it. The RubyMemoryControl has the same issue, but as we have deprecated it, I don't feel it would be a good idea to invest effort to fix it. 8) [status: partially implemented] Allowing multiple requests to a single cache line into Ruby cache controllers: Currently, Ruby Sequencers block multiple outstanding requests to a single cache line, while the new GPUCoalescer will buffer the requests before they can enter the cache controllers. Both of these schemes introduce significant inaccuracy compared to hardware, which can accept multiple accesses per line and queue them as appropriate (e.g. using MSHRs if the line is in an intermediate state, waiting on a request outstanding to a lower level of the hierarchy, etc.). In order to get reasonable modeling, Sequencers will need to pass memory requests to the cache controllers regardless of whether they access a line in an intermediate state. I have implemented this for stores in no-RFO GPU caches, and the performance difference can be massive (e.g. 1.5-3x). The GPUCoalescer will not suffice for this use case, because it requires RFO access to the line. 9) [status: not started] Coalescing within the caches: With the addition of per-byte dirty bits and AMD's GPU cache controllers, there appear to be places where request coalescing can/should be implemented in caches. For example, most L1 cache controllers block stores in mandatory queues while the line is in an intermediate state, but often these stores can be accumulated into a single MSHR and written to the cache block as the cache array is filled with the line. This can have a substantial effect on performance by cutting L1->L2 and L2->MC accesses by factors up to 32+. 10) [status: not started] TBE (MSHR) allocation: Currently, TBETables are finite-sized and disallow over-allocation. If the TBETable size is not set reasonably large, over-allocation results in assertion failures. Often the required sizing to avoid assertion failures is unrealistic (e.g. an RFO GPU L1 cache with 16kB capacity might need as many TBEs as there are entries in the cache itself). This limits the ability to test more reasonable TBE restrictions. It should be straightforward to assess which transitions need to allocate TBEs, so we can test for TBE availability in the controller wakeup functions. This would allow wakeup to skip over transitions that need TBEs when there are none available. -- Joel Hestness PhD Candidate, Computer Architecture Dept. of Computer Science, University of Wisconsin - Madison http://pages.cs.wisc.edu/~hestness/ _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
