Re: [gem5-dev] Review Request 3205: mem: support for gpu-style RMWs in ruby

Tony Gutierrez Wed, 13 Jan 2016 09:12:07 -0800


> On Jan. 6, 2016, 1:39 p.m., Andreas Hansson wrote:
> > I still find that the design needs a far better explanation, and at least 
> > one simple example. I would suggest implementing the SwapReq as an atomic 
> > op (as I guess it is our only example). Also, surely we need to impose 
> > restrictions on the granularity? If we use SwapReq as an example then 
> > things would be much more clear.
> > 
> > I am also not sure about Joel's comments having been addressed.
> 
> Tony Gutierrez wrote:
>     Our GPU code has lots of examples for AtomicOps, and the concept is 
> pretty simple: use function pointers to perform the atomic RMW op instead of 
> adding a bunch of conditions to check all possible SwapReqs. Here is how our 
> GPU CAS is implemented, and you can see all AtomicOps in gpu_dyn_inst.hh:
>     
>     template<typename T>
>     class AtomicOpCAS : public TypedAtomicOpFunctor<T>
>     {
>       public:
>         T c;
>         T s;
>     
>         ComputeUnit *computeUnit;
>     
>         AtomicOpCAS(T _c, T _s, ComputeUnit *compute_unit)
>           : c(_c), s(_s), computeUnit(compute_unit) { }
>     
>         void
>         execute(T *b)
>         {
>             computeUnit->numCASOps++;
>     
>             if (*b == c) {
>                 *b = s;
>             } else {
>                 computeUnit->numFailedCASOps++;
>             }
>     
>             if (computeUnit->xact_cas_mode) {
>                 computeUnit->xactCasLoadMap.clear();
>             }
>         }
>     };
>     
>     A call to the functor will call execute(). In this case the ISA inst 
> objects set the swap value 's' based on one of its src operands, and if the 
> current value 'b' hasn't changed from the orignial 'c' (another src operand), 
> b gets s.
>     
>     I think implementing the SwapReq as an AtomicOp would be as simple as 
> moving the logic in if (pkt->cmd == MemCmd::SwapReq) in abstract_mem.cc to a 
> derived AtomicOpFunctor class and calling it, but I would prefer not to make 
> that change part of this patch. I'm not sure how the AtomicOpFunctor would be 
> setup by the CPU ISAs here and I don't have the time or the infrastructure to 
> test the changes sufficiently to ensure they work properly for the CPU ISAs 
> that use this (ARM and SPARC?), so I think it'd be safer to leave it as it is 
> for now.
>     
>     Getting this patch in is necessary for us to push our GPU code, which is 
> pretty important to lots of people in the community - at least based on the 
> e-mails from users I've been getting - and I would hope this issue wouldn't 
> hold that up.
>     
>     As far as restricting the granularity, I can't imagine a case where one 
> would try to perform a RMW on a size greater than a cache line, but to ensure 
> it I can add a fatal if the size is greater than a cache block.
> 
> Andreas Hansson wrote:
>     I am still trying to come to terms with the overall design choices, and 
> my feeling is that this opens up for lots of complexity (or at least 
> non-transparency).
>     
>     My main hesitation is that this completely departs from the existing gem5 
> self-sufficient request/packet structure, and by doing so makes wrapping in 
> TLM2 or SST more or less impossible. The bridging is really important, and so 
> to justify a change like this we need to carefully consider the pros and cons.
>     
>     What is the actual data payload of the packet in your example? Does it 
> not make more sense to at least tie these operations to the request or packet 
> (not just by pointers), and that way make sure that at least the payload 
> reflects the current state?
> 
> Tony Gutierrez wrote:
>     Ah, your issue is clear to me now. The actual payload, unfortunately, IS 
> the functor. As I mentioned to Joel previously, these carry the data around 
> with them, which is why the original author used functors instead of function 
> pointers. This is also something I have been meaning to fix, but haven't had 
> time.
>     
>     If I move all the functors to the request would you be ok with me pushing 
> this now? The ultimate goal being that the req would carry a regular function 
> pointer to a member method around, and the data payload will be part of the 
> packet data/extra data, and will be moved out of the functors.
> 
> Joel Hestness wrote:
>     Yep, it sounds like Andreas is interested in the same thing as me. The 
> thing is, if we changed this patch as Tony offers above, we still need the 
> function pointer in the request or packet metadata, so it doesn't really 
> address the issue. It's more busy work, so I'd rather just advocate for 
> addressing this in a later patch.
>     
>     I forgot to mention previously that gem5-gpu has code for packing atomics 
> into packet payload (line ~355: 
> http://gem5-gpu.cs.wisc.edu/repo/gem5-gpu/file/92d1e15704a4/src/gpu/lsq_warp_inst_buffer.cc
>  ). The process is a little tricky and uses data size rather than templates 
> there, but might be instructive when we get to that.
> 
> Andreas Hansson wrote:
>     I'd be far more comfortable with these changes, but I would also like the 
> people that have contributed the SST and TLM wrappers to have a look. I 
> really worry about interoperability here.
> 
> Tony Gutierrez wrote:
>     We agree with you in principle on the design not being great, and as I 
> mentioned this is somethign that was already on my 'nice to have' list, but 
> it wasn't high priorty. Now that we understand your conerns better, it has 
> become higher priority. That said, I think our code is isolated relatively 
> well at this point, so it shoudn't affect anything as of now. The pointer 
> will be carried around by the request, but if you're not running GPU code it 
> shouldn't matter - unless the current CAS were to be implemented as an 
> AtomicOpFunctor.
>     
>     I can try to (quickly, thus dangerously) fix up this patch before 
> shipping, but I'd prefer not to as to avoid breaking any current ISAs. 
> Alternatively I can push it as-is, then when I have more time in a few weeks, 
> I (hopefully with input from Joel) can try to unify the design properly, and 
> move the data out of the functors and into the packets.
> 
> Curtis Dunham wrote:
>     This type of approach seems entirely incompatible with an external memory 
> interface like SST (which I maintain the wrapper for).  I agree with the data 
> going into the packet instead of the functor, but then you're still left with 
> the functor, which is an overly general interface.  Out of curiosity, how 
> many of these operations would there be?  I think the cleanest thing would be 
> for them to all be their own MemCmd - would we be talking about just a few 
> more commands, or dozens?
> 
> Tony Gutierrez wrote:
>     I believe there are 22 atomic RMW operations in the HSAIL ISA our GPU 
> model executes; clearly this would lead to too many new MemCmd types, and 
> increase the complexity of the logic for performing atomic RMWs. I think the 
> design of simply specifying the MemCmd as an atomic RMW (SwapReq), and then 
> just executing the functor, which would then perform the actual operation, is 
> a much cleaner design. And this is doable becaue the atomic RMW ops have the 
> same API. I guess I'm not understanding why the use of a general function ptr 
> here would cause a problem for the SST interface - which I know next to 
> nothing about admittedly.
>     
>     If you have a function pointer available in the packet, but aren't using 
> it at all, would this still break things?
>     
>     If all current, necessary RMW (only CAS afaik) were to be implemented as 
> member methods of request, then the atomic op function pointer would just be 
> set to the appropriate member method based on which atomic is actually being 
> executed, thereby restoring self sufficiency of the packet/req. Would that 
> still break the ineroperability?
>     
>     I understand why moving data out of the packet breaks things, but I'm 
> still not clear as to why using a function pointer to call the actual atomic 
> op would break things.
> 
> Brad Beckmann wrote:
>     Curtis and Andreas, we need to check in this patch.  Like Tony said, this 
> patch does not break the external memory interface and SST.  Yes, it is 
> incompatible, but if you don't use the functor, nothing should break.  Please 
> stop demanding perfection the first time a feature is implemented.  We are 
> committed to further improving the feature and we will make more of the 
> changes suggested by you and Joel.  However we to move on from this initial 
> patch.  This is an open source project and none of us are being paid to 
> develop perfect gem5 code.
>     
>     In general, please keep in mind that our team rarely (if ever) demands 
> perfect code from you.  We would appreciate if you reciprocated.
> 
> Andreas Hansson wrote:
>     We are not asking for perfection, we are asking for a design that takes a 
> balanced approach in accommodating existing and future use-cases.
>     
>     I am increasingly convinced that we should just go for encoding of the 
> operations in the packet MemCmd, possibly with the addition of a MemSubCmd. 
> For example, looking at the ARM atomic operations, they are split into atomic 
> compare, atomic swap, atomic load and atomic store. The two latter operations 
> are then combined with a sub-operation, like XOR, ADD, MAX, MIN etc. I'd 
> imagine this largely overlaps with the operations you are introducing. Don't 
> you think? The implementation could be put in a src/mem/atomic_helpers.hh/cc 
> or similar and called by the AbstractMemory, or any caches.
>     
>     Keeping everything in the packet is relly the key concern here, and to 
> answer Tony's question, the current bridges rely on a lookup translation of 
> the fields on the packet. No functions are being called "through" the bridge. 
> I hope that makes things more clear.
> 
> Tony Gutierrez wrote:
>     That is fine. We're open to changing the design going forward - I was 
> already planning on doing so, and it sounds like our respective design 
> proposals would be very similar - and yes, I'm sure the design could converge 
> for all ISA atomics. Doing so, however, is going to be a fairly significant 
> effort. Given that it will take time to converge on the design, then 
> implement it, we'd prefer not to have that hold up this patch, thereby 
> holding up our GPU code.
>     
>     I understand the functions are not being called, I'm just curious how 
> adding a new field to the packet would break that. I was imagining that a 
> gem5 packet would be embedded in some sort of 'packet' SST et al. could 
> consume, and SST would know how to interpret gem5's packet fields. In such a 
> case why can't you just ignore that function pointer field? Since you're not 
> using any atomic ops, it would always be null anyhow. When we eventually 
> clean this up, the possible functions the pointer could point to would be 
> included in Request.
>     
>     The only difference between what I was envisioning, and what you're 
> describing, is that I'd have all atomic op methods be members of reqeust. The 
> request atomicOpFunctor would then point to one of its own members, based on 
> which the ISA isntruction sets it to when it creates the request - since the 
> atomic op type is encoded in the instruction. Memory/cache controllers would 
> then just call atomicOpFunctor. This avoids adding more MemCmds (introducing 
> MemSubCmds) and adding lots of if() checks on the MemCmd in the controllers, 
> but forces the introduction of the member atomic methods.
>     
>     I think a good compromise going forward would be the following:
>     
>     1) Add some sort of atomic rmw sub cmd and have the instruction set this 
> upon req creation.
>     2) Introduce src/mem/atomic_helpers.hh/cc, define all possible atomic 
> RMWs here as methods.
>     3) Have a LUT of function pointers - in mem_helpers.hh/cc - that has 
> function pointers to each of the atomic helper functions.
>     4) Have the memory controllers index into the atomic helpers LUT - based 
> on the sub cmd type - to execute the sub atomic.
>     
>     This would require only the additon of the sub cmds to the reques/packet. 
> Does this seem reasonable to you? If so I think this would be very doable, 
> but as a separate, later patch once we get our GPU code in.
> 
> Tony Gutierrez wrote:
>     Andreas, Curtis: do you have any thoughts on my proposal? Thanks.
> 
> Curtis Dunham wrote:
>     You are right that the existing use-cases for external memory (e.g. with 
> SST) wouldn't be using these atomic ops and so we could just assert that the 
> function pointer is null and everything would be fine.  The problem we're 
> trying to address is not making it impossible to create a mapping so an 
> external memory system like SST could deal with it.  The fundamental problem 
> was the opacity of the function pointer -- if the operations can be 
> labeled/enumerated, then that should be sufficient... and I think your 
> proposal accomplishes this.  One thing I don't understand (and at this point 
> it's not a big deal, but a curiosity certainly) is why we would avoid adding 
> new MemCmd's but the notion of adding a SubCmd is fine?  Obviously if SST did 
> it the other way, hypothetically, the mapping would be trivial -- just trying 
> to understand the design philosophy. But yes this sounds like a much better 
> long term approach than leaving it just as a functor.
>     
>     The way that you want to have each operation be members of request, I 
> don't completely understand this design vs. having say an abstract inner 
> class of Request that each atomic op implements.  This class could just as 
> well provide the SubCmd enum value that it corresponds to.  I may be 
> misunderstanding your explanation, though.  (admittedly this is just "how you 
> dispatch your polymorphism", and virtual functions feel more natural to me 
> than pointers to member functions)
> 
> Tony Gutierrez wrote:
>     There are many ways to accomplish what we need, which is why I would 
> rather put it off for another patch at a later time where we can more 
> thoroughly discuss and vet the design, rather than have me delay this patch 
> or hack something together quickly.
>     
>     The reason I didn't want to add a bunch of MemCmds was too prevent 
> changing the packet/req more than necessary - primarily to avoid causing you 
> guys issues. I had to go through significant effort to refactor many of our 
> patches - to undo such changes - to get them throug the review process 
> prevously. So I was anticipating a lot of push back had I done something like 
> that in order to remove our use of functors. Andreas was the one who 
> mentioned MemSubCmd and I figured if he was ok with it, then it would be 
> fine. I don't have an strong opinion either way.
>     
>     Note that making the atomic ops members of request, while keeping the 
> function pointers, was my original intention. My latest proposal is 
> essentially what Andreas described verbatim, with the addition of the LUT to 
> the atomic_helpers.[hh/cc] in order to avoid if checks to figure out which 
> atomic helper to call. Having dervied request types is also another option, 
> but again I was trying to avoid perturbing the "self-sufficiency" of the 
> request/packet interface.
> 
> Andreas Hansson wrote:
>     I think the solution you propose sounds ok Tony. I am inclined to suggest 
> we don't put it off though, and get the re-work done before committing this, 
> at least the broad brush strokes of what you propose in your four items above.
> 
> Tony Gutierrez wrote:
>     Hi, Andreas and Curtis. We discussed this internally, and while we 
> appreciate your sentiment on this – and your suggestion to not put off our 
> proposed changes – we must insist on pushing this code as is. So unless you 
> strenuously object to this, e.g., this code would break functionality for you 
> immediately, we will push this tomorrow with a commitment to implement what I 
> proposed.
>     
>     We don't have time to commit to this now, and maintaining these patches 
> in and of itself is a lot of work, and delaying them only creates unnecessary 
> work for us, further delaying any progress. And again, I don’t want to break 
> other ISAs by hastily implementing the features I propose. I think the 
> benefit of getting the GPU model into users hands and in the code base, being 
> tested by regressions, is more critical at this point that fixing up the 
> atomic op implementation.
> 
> Andreas Hansson wrote:
>     From my perspective the total effort is constant (do it now or later), so 
> better just get it right. I'm not thrilled at the idea of the IOU, but I 
> appreciate the desire to move things forward. When can we expect this to be 
> addressed?
>     
>     Also, it would be good if the patch description and comments were updated 
> to reflect that this is generic atomic operations (not just GPU RMWs, and not 
> just Ruby), and the review discussion outcomes should probably be included 
> somehow, i.e. highlighting the issues and the envisioned solution.


Thanks for understanding. I agree the effort to complete the eventual 
implementation is constant, however we don't X amount of time to do it now. 
Putting off committing these essentially makes the amount of time needed X + 
whatever time it takes to keep rebasing/testing patches everytime the base 
changes, as well as delays getting important features into the code base.

I can certainly add the description/comments you requested before shipping this 
tomorrow. I anticipate I will have sufficient time to properly fix this code 
sometime next month.


- Tony


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/3205/#review7813
-----------------------------------------------------------


On Jan. 6, 2016, 1:12 p.m., Tony Gutierrez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://reviews.gem5.org/r/3205/
> -----------------------------------------------------------
> 
> (Updated Jan. 6, 2016, 1:12 p.m.)
> 
> 
> Review request for Default.
> 
> 
> Repository: gem5
> 
> 
> Description
> -------
> 
> Changeset 11280:f5e328bd5233
> ---------------------------
> mem: support for gpu-style RMWs in ruby
> 
> This patch adds support for GPU-style read-modify-write (RMW) operations in
> ruby. Such atomic operations are traditionally executed at the memory 
> controller
> (instead of through an L1 cache using cache-line locking).
> 
> Currently, this patch works by propogating operation functors through the 
> memory
> system.
> 
> 
> Diffs
> -----
> 
>   src/base/types.hh d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/abstract_mem.cc d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/packet.hh d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/protocol/RubySlicc_Exports.sm 
> d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
>   src/mem/request.hh d9a0136ab8cc4b3cf4821d064140b857e60db0dd 
> 
> Diff: http://reviews.gem5.org/r/3205/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Tony Gutierrez
> 
>

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request 3205: mem: support for gpu-style RMWs in ruby

Reply via email to