Hi guys,

  I've been working on some benchmarks that place unique/new stresses on
heterogeneous CPU-GPU memory hierarchies. In trying to tighten up the
hierarchy performance, I've run into a number of strange cache
buffering/flow control issues in Ruby. We've talked about fixing these
things, but I've found a need to inventory all the places where
buffering/prioritization needs work. Below is my list, which can hopefully
serve as a starting point and offer a broader picture to anyone who wishes
to use Ruby with more realistic memory hierarchy buffering. I've included
my current status addressing each.

  Please let me know if you have any input or would like to help address
the issues. Any help would be appreciated.

  Thank you!
  Joel


1) [status: not started] SLICC parses actions to accumulate the total
buffer capacity as required by all enqueue operations within the action.
Unfortunately, the resource checking is usually overly conservative
resulting in two problems:
   A) Many actions contain multiple code paths and not all paths get
executed to push requests into buffers. The actual resources required are
frequently less than SLICC's parsed value. As an example, an action with an
if-else block containing an enqueue() on both paths will parse to require
two buffers even though only one or the other enqueue() will be called, but
not both.
   B) The resource checking can result in poorly prioritized transitions if
they require allocating more resources than other transitions. For
instance, a high priority transition (e.g. responses) may require a slot in
2 separate buffers, while a lower priority transition (e.g. requests) may
require one of those slots. If the higher priority transition gets blocked,
the lower priority transition can be allowed to proceed, resulting in
priority inversion and possibly even starvation.
  Performance debugging these issues could be exceptionally difficult. As
an example of performance issues, in MOESI_hammer, a directory transition
that would involve activity iff using a full-bit directory may register
excessive buffer requirements and block waiting for the unnecessary buffers
(even though the directory is not configured to use the full-bit data!).
  By manually hacking generated files to avoid these incorrect buffer
requirements, I've already witnessed performance improvements of greater
than 3%, and I haven't even stressed the memory hierarchy.

2) [status: complete, posted] Finite-sized buffers are not actually finite
sized: When a Ruby controller stalls a message in a MessageBuffer, that
message is removed from the MessageBuffer m_prio_heap and placed in
m_stall_msg_map queues until the queue is reanalyzed by a wake-up activity
in the controller. Unfortunately, when checking whether there is enough
space in the buffer to add more requests, the measured size only considers
the size of m_prio_heap, but not messages that might be in the
m_stall_msg_map. In extreme cases, I've seen the m_stall_msg_map hold >500
messages in a MessageBuffer with size = 10. Here's a patch that fixes this:
http://reviews.gem5.org/r/3283/

3) [status: not started] Virtual channel specification and prioritization
is inconsistent: Currently, in each cycle, the PerfectSwitch in the Ruby
simple network iterates through virtual channels from highest ID to lowest
ID, indicating that higher IDs have higher priority. By contrast, Garnet
cycles through virtual channel IDs from lowest to highest, indicating that
lower IDs have higher priority. Since SLICC controller files specify
virtual channels independent of the interconnect that is used with Ruby,
the virtual channel prioritization may be inverted depending on the network
that is used. The different Ruby network models need to agree on the
prioritization in order to avoid potential priority inversion.

4) [status: not started] Sequencers push requests into mandatory queues
regardless of whether the mandatory queue is finite-sized and possibly
full. With poorly configured sequencer and L1 mandatory queue, it is
possible to fill the L1 mandatory queue, but still have space in the
sequencer's requestTable. Since the Sequencer doesn't check whether the
mandatory queue has slots available, it cannot honor the mandatory queue's
capacity correctly. This should be fixed and/or a warn/fatal should be
raised to let the user know about poor configuration.

5) [status: complete, revising] SimpleNetwork access prioritization is not
suited for finite buffering + near-peak bandwidth: The PerfectSwitch uses a
round-robin prioritization scheme to select the input ports that have
priority to issues to an output port, and it steps through input ports in
ascending order to find one with ready messages. When some input port
buffers are full and others are empty, the lower ID input ports effectively
get prioritization when the round-robin ID is greater than the highest ID
input port that has messages. For fair prioritization, input ports with
ready messages should not be allowed to issue twice before other input
ports with ready messages are allowed to issue once. My cursory inspection
of Garnet routers suggests that they probably suffer from the same
arbitration issue.

6) [status: complete, revising] QueuedMasterPort used for requests from
Ruby directories to memory controllers: This fills up very quickly with a
GPU requester, and results in the PacketQueue triggering the panic that the
queue is too large (>100 packets). The RubyMemoryControl has infinite input
port queuing, so it can be used, but other memory controllers cannot.
Further, I have measured that even with roughly reasonable buffering
throughout the memory hierarchy, average memory stall cycles in the Ruby
memory controller in-queue can be upwards of 5,000 cycles (which is
nonsensical). To fix this, we need to pull the queuing out of the
Directory_Controller memory port and into a finite queue managed by the *-
dir.sm files, and handle flow control in the port to memory. I have mostly
implemented this, and will post a patch for review soon.

7) [status: partially implemented] QueuedSlavePort from memory controllers
back to Ruby directories: After fixing the memory controller input queues,
bloated buffering immediately jumps to the memory controller response
queues, which are implemented as QueuedSlavePorts. I've started trying to
fix this up in the DRAMCtrl, but given the complexity, have yet to finish
it. The RubyMemoryControl has the same issue, but as we have deprecated it,
I don't feel it would be a good idea to invest effort to fix it.

8) [status: partially implemented] Allowing multiple requests to a single
cache line into Ruby cache controllers: Currently, Ruby Sequencers block
multiple outstanding requests to a single cache line, while the new
GPUCoalescer will buffer the requests before they can enter the cache
controllers. Both of these schemes introduce significant inaccuracy
compared to hardware, which can accept multiple accesses per line and queue
them as appropriate (e.g. using MSHRs if the line is in an intermediate
state, waiting on a request outstanding to a lower level of the hierarchy,
etc.). In order to get reasonable modeling, Sequencers will need to pass
memory requests to the cache controllers regardless of whether they access
a line in an intermediate state. I have implemented this for stores in
no-RFO GPU caches, and the performance difference can be massive (e.g.
1.5-3x). The GPUCoalescer will not suffice for this use case, because it
requires RFO access to the line.

9) [status: not started] Coalescing within the caches: With the addition of
per-byte dirty bits and AMD's GPU cache controllers, there appear to be
places where request coalescing can/should be implemented in caches. For
example, most L1 cache controllers block stores in mandatory queues while
the line is in an intermediate state, but often these stores can be
accumulated into a single MSHR and written to the cache block as the cache
array is filled with the line. This can have a substantial effect on
performance by cutting L1->L2 and L2->MC accesses by factors up to 32+.

10) [status: not started] TBE (MSHR) allocation: Currently, TBETables are
finite-sized and disallow over-allocation. If the TBETable size is not set
reasonably large, over-allocation results in assertion failures. Often the
required sizing to avoid assertion failures is unrealistic (e.g. an RFO GPU
L1 cache with 16kB capacity might need as many TBEs as there are entries in
the cache itself). This limits the ability to test more reasonable TBE
restrictions. It should be straightforward to assess which transitions need
to allocate TBEs, so we can test for TBE availability in the controller
wakeup functions. This would allow wakeup to skip over transitions that
need TBEs when there are none available.

-- 
  Joel Hestness
  PhD Candidate, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to