Re: [gem5-dev] Review Request 3205: mem: support for gpu-style RMWs in ruby

Tony Gutierrez Fri, 08 Jan 2016 16:26:15 -0800


> On Jan. 6, 2016, 1:39 p.m., Andreas Hansson wrote:
> > I still find that the design needs a far better explanation, and at least 
> > one simple example. I would suggest implementing the SwapReq as an atomic 
> > op (as I guess it is our only example). Also, surely we need to impose 
> > restrictions on the granularity? If we use SwapReq as an example then 
> > things would be much more clear.
> > 
> > I am also not sure about Joel's comments having been addressed.
> 
> Tony Gutierrez wrote:
>     Our GPU code has lots of examples for AtomicOps, and the concept is 
> pretty simple: use function pointers to perform the atomic RMW op instead of 
> adding a bunch of conditions to check all possible SwapReqs. Here is how our 
> GPU CAS is implemented, and you can see all AtomicOps in gpu_dyn_inst.hh:
>     
>     template<typename T>
>     class AtomicOpCAS : public TypedAtomicOpFunctor<T>
>     {
>       public:
>         T c;
>         T s;
>     
>         ComputeUnit *computeUnit;
>     
>         AtomicOpCAS(T _c, T _s, ComputeUnit *compute_unit)
>           : c(_c), s(_s), computeUnit(compute_unit) { }
>     
>         void
>         execute(T *b)
>         {
>             computeUnit->numCASOps++;
>     
>             if (*b == c) {
>                 *b = s;
>             } else {
>                 computeUnit->numFailedCASOps++;
>             }
>     
>             if (computeUnit->xact_cas_mode) {
>                 computeUnit->xactCasLoadMap.clear();
>             }
>         }
>     };
>     
>     A call to the functor will call execute(). In this case the ISA inst 
> objects set the swap value 's' based on one of its src operands, and if the 
> current value 'b' hasn't changed from the orignial 'c' (another src operand), 
> b gets s.
>     
>     I think implementing the SwapReq as an AtomicOp would be as simple as 
> moving the logic in if (pkt->cmd == MemCmd::SwapReq) in abstract_mem.cc to a 
> derived AtomicOpFunctor class and calling it, but I would prefer not to make 
> that change part of this patch. I'm not sure how the AtomicOpFunctor would be 
> setup by the CPU ISAs here and I don't have the time or the infrastructure to 
> test the changes sufficiently to ensure they work properly for the CPU ISAs 
> that use this (ARM and SPARC?), so I think it'd be safer to leave it as it is 
> for now.
>     
>     Getting this patch in is necessary for us to push our GPU code, which is 
> pretty important to lots of people in the community - at least based on the 
> e-mails from users I've been getting - and I would hope this issue wouldn't 
> hold that up.
>     
>     As far as restricting the granularity, I can't imagine a case where one 
> would try to perform a RMW on a size greater than a cache line, but to ensure 
> it I can add a fatal if the size is greater than a cache block.
> 
> Andreas Hansson wrote:
>     I am still trying to come to terms with the overall design choices, and 
> my feeling is that this opens up for lots of complexity (or at least 
> non-transparency).
>     
>     My main hesitation is that this completely departs from the existing gem5 
> self-sufficient request/packet structure, and by doing so makes wrapping in 
> TLM2 or SST more or less impossible. The bridging is really important, and so 
> to justify a change like this we need to carefully consider the pros and cons.
>     
>     What is the actual data payload of the packet in your example? Does it 
> not make more sense to at least tie these operations to the request or packet 
> (not just by pointers), and that way make sure that at least the payload 
> reflects the current state?
> 
> Tony Gutierrez wrote:
>     Ah, your issue is clear to me now. The actual payload, unfortunately, IS 
> the functor. As I mentioned to Joel previously, these carry the data around 
> with them, which is why the original author used functors instead of function 
> pointers. This is also something I have been meaning to fix, but haven't had 
> time.
>     
>     If I move all the functors to the request would you be ok with me pushing 
> this now? The ultimate goal being that the req would carry a regular function 
> pointer to a member method around, and the data payload will be part of the 
> packet data/extra data, and will be moved out of the functors.
> 
> Joel Hestness wrote:
>     Yep, it sounds like Andreas is interested in the same thing as me. The 
> thing is, if we changed this patch as Tony offers above, we still need the 
> function pointer in the request or packet metadata, so it doesn't really 
> address the issue. It's more busy work, so I'd rather just advocate for 
> addressing this in a later patch.
>     
>     I forgot to mention previously that gem5-gpu has code for packing atomics 
> into packet payload (line ~355: 
> http://gem5-gpu.cs.wisc.edu/repo/gem5-gpu/file/92d1e15704a4/src/gpu/lsq_warp_inst_buffer.cc
>  ). The process is a little tricky and uses data size rather than templates 
> there, but might be instructive when we get to that.
> 
> Andreas Hansson wrote:
>     I'd be far more comfortable with these changes, but I would also like the 
> people that have contributed the SST and TLM wrappers to have a look. I 
> really worry about interoperability here.
> 
> Tony Gutierrez wrote:
>     We agree with you in principle on the design not being great, and as I 
> mentioned this is somethign that was already on my 'nice to have' list, but 
> it wasn't high priorty. Now that we understand your conerns better, it has 
> become higher priority. That said, I think our code is isolated relatively 
> well at this point, so it shoudn't affect anything as of now. The pointer 
> will be carried around by the request, but if you're not running GPU code it 
> shouldn't matter - unless the current CAS were to be implemented as an 
> AtomicOpFunctor.
>     
>     I can try to (quickly, thus dangerously) fix up this patch before 
> shipping, but I'd prefer not to as to avoid breaking any current ISAs. 
> Alternatively I can push it as-is, then when I have more time in a few weeks, 
> I (hopefully with input from Joel) can try to unify the design properly, and 
> move the data out of the functors and into the packets.
> 
> Curtis Dunham wrote:
>     This type of approach seems entirely incompatible with an external memory 
> interface like SST (which I maintain the wrapper for).  I agree with the data 
> going into the packet instead of the functor, but then you're still left with 
> the functor, which is an overly general interface.  Out of curiosity, how 
> many of these operations would there be?  I think the cleanest thing would be 
> for them to all be their own MemCmd - would we be talking about just a few 
> more commands, or dozens?
> 
> Tony Gutierrez wrote:
>     I believe there are 22 atomic RMW operations in the HSAIL ISA our GPU 
> model executes; clearly this would lead to too many new MemCmd types, and 
> increase the complexity of the logic for performing atomic RMWs. I think the 
> design of simply specifying the MemCmd as an atomic RMW (SwapReq), and then 
> just executing the functor, which would then perform the actual operation, is 
> a much cleaner design. And this is doable becaue the atomic RMW ops have the 
> same API. I guess I'm not understanding why the use of a general function ptr 
> here would cause a problem for the SST interface - which I know next to 
> nothing about admittedly.
>     
>     If you have a function pointer available in the packet, but aren't using 
> it at all, would this still break things?
>     
>     If all current, necessary RMW (only CAS afaik) were to be implemented as 
> member methods of request, then the atomic op function pointer would just be 
> set to the appropriate member method based on which atomic is actually being 
> executed, thereby restoring self sufficiency of the packet/req. Would that 
> still break the ineroperability?
>     
>     I understand why moving data out of the packet breaks things, but I'm 
> still not clear as to why using a function pointer to call the actual atomic 
> op would break things.
> 
> Brad Beckmann wrote:
>     Curtis and Andreas, we need to check in this patch.  Like Tony said, this 
> patch does not break the external memory interface and SST.  Yes, it is 
> incompatible, but if you don't use the functor, nothing should break.  Please 
> stop demanding perfection the first time a feature is implemented.  We are 
> committed to further improving the feature and we will make more of the 
> changes suggested by you and Joel.  However we to move on from this initial 
> patch.  This is an open source project and none of us are being paid to 
> develop perfect gem5 code.
>     
>     In general, please keep in mind that our team rarely (if ever) demands 
> perfect code from you.  We would appreciate if you reciprocated.
> 
> Andreas Hansson wrote:
>     We are not asking for perfection, we are asking for a design that takes a 
> balanced approach in accommodating existing and future use-cases.
>     
>     I am increasingly convinced that we should just go for encoding of the 
> operations in the packet MemCmd, possibly with the addition of a MemSubCmd. 
> For example, looking at the ARM atomic operations, they are split into atomic 
> compare, atomic swap, atomic load and atomic store. The two latter operations 
> are then combined with a sub-operation, like XOR, ADD, MAX, MIN etc. I'd 
> imagine this largely overlaps with the operations you are introducing. Don't 
> you think? The implementation could be put in a src/mem/atomic_helpers.hh/cc 
> or similar and called by the AbstractMemory, or any caches.
>     
>     Keeping everything in the packet is relly the key concern here, and to 
> answer Tony's question, the current bridges rely on a lookup translation of 
> the fields on the packet. No functions are being called "through" the bridge. 
> I hope that makes things more clear.
> 
> Tony Gutierrez wrote:
>     That is fine. We're open to changing the design going forward - I was 
> already planning on doing so, and it sounds like our respective design 
> proposals would be very similar - and yes, I'm sure the design could converge 
> for all ISA atomics. Doing so, however, is going to be a fairly significant 
> effort. Given that it will take time to converge on the design, then 
> implement it, we'd prefer not to have that hold up this patch, thereby 
> holding up our GPU code.
>     
>     I understand the functions are not being called, I'm just curious how 
> adding a new field to the packet would break that. I was imagining that a 
> gem5 packet would be embedded in some sort of 'packet' SST et al. could 
> consume, and SST would know how to interpret gem5's packet fields. In such a 
> case why can't you just ignore that function pointer field? Since you're not 
> using any atomic ops, it would always be null anyhow. When we eventually 
> clean this up, the possible functions the pointer could point to would be 
> included in Request.
>     
>     The only difference between what I was envisioning, and what you're 
> describing, is that I'd have all atomic op methods be members of reqeust. The 
> request atomicOpFunctor would then point to one of its own members, based on 
> which the ISA isntruction sets it to when it creates the request - since the 
> atomic op type is encoded in the instruction. Memory/cache controllers would 
> then just call atomicOpFunctor. This avoids adding more MemCmds (introducing 
> MemSubCmds) and adding lots of if() checks on the MemCmd in the controllers, 
> but forces the introduction of the member atomic methods.
>     
>     I think a good compromise going forward would be the following:
>     
>     1) Add some sort of atomic rmw sub cmd and have the instruction set this 
> upon req creation.
>     2) Introduce src/mem/atomic_helpers.hh/cc, define all possible atomic 
> RMWs here as methods.
>     3) Have a LUT of function pointers - in mem_helpers.hh/cc - that has 
> function pointers to each of the atomic helper functions.
>     4) Have the memory controllers index into the atomic helpers LUT - based 
> on the sub cmd type - to execute the sub atomic.
>     
>     This would require only the additon of the sub cmds to the reques/packet. 
> Does this seem reasonable to you? If so I think this would be very doable, 
> but as a separate, later patch once we get our GPU code in.
> 
> Tony Gutierrez wrote:
>     Andreas, Curtis: do you have any thoughts on my proposal? Thanks.
> 
> Curtis Dunham wrote:
>     You are right that the existing use-cases for external memory (e.g. with 
> SST) wouldn't be using these atomic ops and so we could just assert that the 
> function pointer is null and everything would be fine.  The problem we're 
> trying to address is not making it impossible to create a mapping so an 
> external memory system like SST could deal with it.  The fundamental problem 
> was the opacity of the function pointer -- if the operations can be 
> labeled/enumerated, then that should be sufficient... and I think your 
> proposal accomplishes this.  One thing I don't understand (and at this point 
> it's not a big deal, but a curiosity certainly) is why we would avoid adding 
> new MemCmd's but the notion of adding a SubCmd is fine?  Obviously if SST did 
> it the other way, hypothetically, the mapping would be trivial -- just trying 
> to understand the design philosophy. But yes this sounds like a much better 
> long term approach than leaving it just as a functor.
>     
>     The way that you want to have each operation be members of request, I 
> don't completely understand this design vs. having say an abstract inner 
> class of Request that each atomic op implements.  This class could just as 
> well provide the SubCmd enum value that it corresponds to.  I may be 
> misunderstanding your explanation, though.  (admittedly this is just "how you 
> dispatch your polymorphism", and virtual functions feel more natural to me 
> than pointers to member functions)

There are many ways to accomplish what we need, which is why I would rather put 
it off for another patch at a later time where we can more thoroughly discuss 
and vet the design, rather than have me delay this patch or hack something 
together quickly.

The reason I didn't want to add a bunch of MemCmds was too prevent changing the 
packet/req more than necessary - primarily to avoid causing you guys issues. I 
had to go through significant effort to refactor many of our patches - to undo 
such changes - to get them throug the review process prevously. So I was 
anticipating a lot of push back had I done something like that in order to 
remove our use of functors. Andreas was the one who mentioned MemSubCmd and I 
figured if he was ok with it, then it would be fine. I don't have an strong 
opinion either way.

Note that making the atomic ops members of request, while keeping the function 
pointers, was my original intention. My latest proposal is essentially what 
Andreas described verbatim, with the addition of the LUT to the 
atomic_helpers.[hh/cc] in order to avoid if checks to figure out which atomic 
helper to call. Having dervied request types is also another option, but again 
I was trying to avoid perturbing the "self-sufficiency" of the request/packet 
interface.

- Tony

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/3205/#review7813
-----------------------------------------------------------

On Jan. 6, 2016, 1:12 p.m., Tony Gutierrez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://reviews.gem5.org/r/3205/
> -----------------------------------------------------------
> 
> (Updated Jan. 6, 2016, 1:12 p.m.)
> 
> 
> Review request for Default.
> 
> 
> Repository: gem5
> 
> 
> Description
> -------
> 
> Changeset 11280:f5e328bd5233
> ---------------------------
> mem: support for gpu-style RMWs in ruby
> 
> This patch adds support for GPU-style read-modify-write (RMW) operations in
> ruby. Such atomic operations are traditionally executed at the memory 
> controller
> (instead of through an L1 cache using cache-line locking).
> 
> Currently, this patch works by propogating operation functors through the 
> memory
> system.
> 
> 
> Diffs
> -----
> 
>   src/base/types.hh d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/abstract_mem.cc d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/packet.hh d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/protocol/RubySlicc_Exports.sm 
> d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/request.hh d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
> 
> Diff: http://reviews.gem5.org/r/3205/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Tony Gutierrez
> 
>

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request 3205: mem: support for gpu-style RMWs in ruby

Reply via email to