Re: [gem5-dev] Review Request: Forward invalidations from Ruby to O3 CPU

Brad Beckmann Fri, 28 Oct 2011 15:01:10 -0700


> On 2011-10-27 22:35:21, Brad Beckmann wrote:
> > Thanks for the heads up on this patch.  I'm glad you found the time to dive 
> > into it.
> > 
> > 
> > 
> > I'm confused that the comment mentions a "list of ports", but I don't see a 
> > list of ports in the code and I'm not sure how would even be used?
> > 
> > The two questions you pose are good ones.  Hopefully someone who 
> > understands the O3 LSQ can answer the first, and I would suggest creating a 
> > new directed test that can manipulate the enqueue latency on the mandatory 
> > queue to create the necessary test situations. 
> > 
> > Also, I have a couple high-level comments right now:
> > 
> > 
> > 
> > - Ruby doesn't implement any particular memory model.  It just implements 
> > the cache coherence protocol, and more specifically invalidation based 
> > protocols.  The protocol, in combination with the core model, results in 
> > the memory model.
> > 
> > 
> > - I don't think it is sufficient to just forward those probes that hit 
> > valid copies to the O3 model.  What about replacements of blocks that have 
> > serviced a speculative load?  Instead, my thought would be to forward all 
> > probes to the O3 LSQ and think of cpu-controlled policies to filter out 
> > unecessary probes.
> 
> Nilay Vaish wrote:
>     Hi Brad, thanks for the response.
>     
>     * A list of ports has been added to RubyPort.hh, the ports are added
>       to the list whenever a new M5Port is created.
>     
>     * As long as the core waits for an ack from the memory system for every 
> store
>       before issuing the next one, I can understand that memory model is 
> independent
>       of how the memory system is implemented. But suppose the caches are 
> multi-ported.
>       Then will the core only use one of the ports for stores and wait for an 
> ack?
>       The current LSQ implementation uses as many ports as available. In this 
> case,
>       would not the memory system need to ensure the order in which the 
> stores are
>       performed?
>     
>     * I think the current implementation handles blocks coherence permissions 
> for
>       which were speculatively fetched. If the cache looses permissions on 
> this
>       block, then it will forward the probe to the CPU. If the cache again 
> receives
>       a probe for this block, I don't think that the CPU will have any 
> instruction
>       using the value from that block.
>     
>     * For testing, Prof. Wood suggested having some thing similar to TSOtool.
> 
> Brad Beckmann wrote:
>     Hmm...I'm now even more confused.  I have not looked at the O3 LSQ, but 
> it sounds like from your description that one particular instantiation of the 
> LSQ will use N ports, not just a single port to the L1D.  So does N equal the 
> number of simultaneous loads and stores that can be issued per cycle, or is N 
> equal to the number of outstanding loads and stores supported by the LSQ?  Or 
> does it equal something completely different?
>     
>     Stores to different cache blocks can be issued to the memory system 
> out-of-order and in parallel.  Ruby already supports such functionality.  The 
> key is the store buffer must be drained in-order.  It is up to the store 
> buffer's functionality to get that right.  Ruby can assist by providing 
> interfaces for checking permission state and forwarding probes upstream, but 
> it is up to the LSQ/store buffer to act appropriately and retry requests when 
> necessary.  I don't believe Ruby needs any fundamental changes to support 
> x86-TSO.  Instead, Ruby just needs to provide more information back to the 
> LSQ.
>     
>     Earlier I didn't notice that you also squash speculation on replacements, 
> in addition to probes.  Yeah, I think those changes take care of correctly 
> squashing speculative loads.  However, as I mentioned above, I still think we 
> need to figure out how to provide the necessary information to allow stores 
> to be issued in parallel, while still retiring in-order.
>     
>     Implementing something similar to TSOtool would be great.  However, I 
> think there is benefit to do some quick tests using a DirectedTester before 
> creating something like TSOtool.
>     
>     
>
> 
> Nilay Vaish wrote:
>     Brad,
>     
>     My understanding is that the LSQ can issue at most N loads and stores to
>     the memory system in each cycle.
>     
>     For parallel stores, it seems that the core should have permissions for
>     these cache blocks all at the same time. Even if Ruby fetches coherence
>     permissions out-of-order, it would still have to ensure, for SC or TSO,
>     that stores that happened logically later in time become visible only
>     after all the earlier ones are visible to rest of the system. As of now,
>     I disagree with the statement that --
>               '' Stores to different cache blocks can be issued to the
>                  memory system out-of-order and in parallel ''
>     Unless we have some kind of guarantee on the order in which these stores
>     become visible to the rest of the system, I don't see how we can separate
>     out the memory system's behavior from the consistency model.
>     
>     I was thinking of writing a tester that reads in a trace of memory 
> operations
>     performed by a multi-processor system and the times at which these are 
> performed.
>     Then we can check the load values against the expected load values. I 
> think the
>     underlying assumption is that everything behaves in a deterministic 
> fashion. What
>     do you think?


Thanks for confirming the O3 LSQ requirement for N ports.  I've got no further 
questions on that.

Stores can certainly be issued out-of-order in modern x86 processors.  It is 
the store buffer's responsibility to ensure that stores become globally visible 
in program order.  Maybe what you're getting at is that Ruby needs to support a 
two-phase store scheme so that the initial writeHitCallback supplies data to 
the CPU but does not update the L1 D cache block.  I would agree to that.  My 
point is that Ruby should only be responsible to provide the necessary 
information and interfaces to the LSQ logic.  There is no reason to change the 
logic of Ruby's invalidation-based coherence protocols.  It is the LSQ's 
(including store buffer) responsibility to ensure the correct order of store 
visibility.

Yes, your tester idea is essentially what I had in mind.  The only thing I want 
to point out is that it may beneficial to include both the time the request 
should issue and a delta of how long the request should be stalled in the 
mandatory queue.  That way you can instigate races where younger memory ops 
deterministically bypass older ops.


- Brad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1620
-----------------------------------------------------------


On 2011-10-17 23:50:47, Nilay Vaish wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://reviews.m5sim.org/r/894/
> -----------------------------------------------------------
> 
> (Updated 2011-10-17 23:50:47)
> 
> 
> Review request for Default.
> 
> 
> Summary
> -------
> 
> This patch implements the functionality for forwarding invalidations
> and replacements from the L1 cache of the Ruby memory system to the O3
> CPU. The implementation adds a list of ports to RubyPort. Whenever a 
> replacement
> or an invalidation is performed, the L1 cache forwards this to all the ports,
> which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
> LSQ should take a close look at the implementation and figure out (at least
> qualitatively) if some thing is missing or erroneous.
> 
> This patch only modifies the MESI CMP directory protocol. I will modify other
> protocols once we sort the major issues surrounding this patch.
> 
> My understanding is that this should ensure an SC execution, as
> long as Ruby can support SC. But I think Ruby does not support any 
> memory model currently. A couple of issues that need discussion --
> 
> * Can this get in to a deadlock? A CPU may not be able to proceed if
>   a particularly cache block is repeatedly invalidated before the CPU
>   can retire the actual load/store instruction. How do we ensure that
>   at least one instruction is retired before an invalidation/replacement
>   is processed?
> 
> * How to test this implementation? Is it possible to implement some of the
>   tests that we regularly come across in papers on consistency models? Or
>   those present in manuals from AMD and Intel? I have tested that Ruby will
>   forward the invalidations, but not the part where the LSQ needs to act on
>   it.
> 
> 
> Diffs
> -----
> 
>   build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc 
>   configs/example/se.py 92ba80d63abc 
>   configs/ruby/MESI_CMP_directory.py 92ba80d63abc 
>   src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc 
>   src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc 
>   src/mem/ruby/system/RubyPort.hh 92ba80d63abc 
>   src/mem/ruby/system/RubyPort.cc 92ba80d63abc 
>   src/mem/ruby/system/Sequencer.hh 92ba80d63abc 
>   src/mem/ruby/system/Sequencer.cc 92ba80d63abc 
> 
> Diff: http://reviews.m5sim.org/r/894/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Nilay
> 
>

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request: Forward invalidations from Ruby to O3 CPU

Reply via email to