JosephTheOctonaut commented on PR #80:
URL: https://github.com/apache/tvm-rfcs/pull/80#issuecomment-1162481071

   Awesome stuff, @masahi !
   
   Specific questions:
   
   - I don't think I saw it explicitly mentioned, but it is assumed/required 
that the backend executes committed chunks in order, right? E.g., if the 
program commits a chunk for i=0 to i=5 before hitting a `wait`, the hardware 
can't decide to execute the i=5 chunk first.
   - The index `"software_pipeline_async_stages"` refers to elements of the 
`"software_pipeline_stage"` list, correct? I found this a bit confusing, 
because the asynchronous "stages" are not the same as the pipeline "stages". 
(e.g., a pipeline with `"software_pipeline_stage", [0,0,0,1]` could have 
`"software_pipeline_async_stages, [2,3]"` when there are no pipeline stages 2 
and 3)
   - The definition of `async_commit_stage(i)` is to "Group one or more 
invocation of async operations...". Is the model here that all uncommitted 
async operations at that point in the program are committed? Furthermore, are 
"async operations" defined as any statement in a pipeline stage that was 
identified as async?
     - I initially found it confusing that `async_commit_stage(i)` could appear 
inside and outside of an `async_scope:` section. It makes sense that it could 
be both, because (assuming my above assumptions are correct) it doesn't have to 
do with any lexical scoping, operating instead on the "hidden state" of 
whatever async operations are uncommitted. No specific question here, but maybe 
a more intuitive grouping is possible that would be syntactically visible.
   - Is it correct that `async_wait_stage` will only block the "main" thread of 
execution? For example, if (on a particular backend) an async pipeline stage is 
executed by another thread, and `async_wait_stage` appeared in that stage, 
which thread would it block? I'm assuming the answer is "whatever the backend 
lowering does," but I was curious if there was an intended semantics.
   
   Small doc nits for clarity:
   
   - `for k0 in range(125):` I think the index should be `i` here
   - `B_local[0] <- B_shared[(I + 1) % 4, ...]` capital `I` used instead of `i`
   - `async_scoppe` instead of `async_scope`
   
   
   Higher-level thoughts:
   
   One concern is moving an async system into TIR that is too narrow. You 
discuss this quite a bit, and make a lot of good points. This system clearly 
works very well for CUDA, but if there isn't a good transform to token-based 
systems, do we end up with two async systems in TIR? Not trying to suggest this 
should not be incorporated, just genuinely curious what the long-term plans 
might be.
   
   >Importantly, we are not trying to propose a “general async semantics to 
TIR”. Rather the goal is to come up with an async design specifically for the 
TIR software pipeline transform.
   
   This can be a bit difficult, as it would make sense for future systems to 
try to use the existing async system to whatever extent possible. Again, not 
trying to suggest a specific course of action, just pointing out that there's a 
difficult general-vs-specific trade-off here.
   
   Regarding interopability with a token-based model: you make an excellent 
point that a token system has expressibility issues in a TIR program, because 
it has to refer to a specific statement at a specific loop iteration. But, it's 
also more intuitive. This contrasts with the implicit system here, which feels 
unintuitive, but is quite succinct and easily expressible. 
   
   I'm still working through what a translation to/from a token system might 
look like, but I'm currently thinking that they're much closer than I initially 
thought. In either case, what we want (at a high level) is a way to refer to a 
specific chunk, blocking execution until that chunk has finished. Your comment 
about keeping a buffer in the token case made me realize that it ends up pretty 
similar: a token system waiting for chunk _C_ from _N_ iterations previous 
might use `wait(C, i-N)`, which looks pretty similar to the `wait(C, N)` we 
would use here. (Just my preliminary thoughts; I'm not sure if there's a nicer 
abstraction to use in TIR that caters to both systems.)
   
   It's interesting that you mention that there's no easy translation from 
tokens to counting (based on MLIR not having implemented one?), but you suspect 
the reverse could be simple. Does this suggest that the token system has less 
information encoded than the counting system? (I.e., we can go counting --> 
tokens but not the reverse because we lost information in the transformation.) 
Or is it just specifics of a PTX-like system, not a "counting system" in 
general, that make the translation to it hard?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to