JosephTheOctonaut commented on PR #80:
URL: https://github.com/apache/tvm-rfcs/pull/80#issuecomment-1162481071
Awesome stuff, @masahi !
Specific questions:
- I don't think I saw it explicitly mentioned, but it is assumed/required
that the backend executes committed chunks in order, right? E.g., if the
program commits a chunk for i=0 to i=5 before hitting a `wait`, the hardware
can't decide to execute the i=5 chunk first.
- The index `"software_pipeline_async_stages"` refers to elements of the
`"software_pipeline_stage"` list, correct? I found this a bit confusing,
because the asynchronous "stages" are not the same as the pipeline "stages".
(e.g., a pipeline with `"software_pipeline_stage", [0,0,0,1]` could have
`"software_pipeline_async_stages, [2,3]"` when there are no pipeline stages 2
and 3)
- The definition of `async_commit_stage(i)` is to "Group one or more
invocation of async operations...". Is the model here that all uncommitted
async operations at that point in the program are committed? Furthermore, are
"async operations" defined as any statement in a pipeline stage that was
identified as async?
- I initially found it confusing that `async_commit_stage(i)` could appear
inside and outside of an `async_scope:` section. It makes sense that it could
be both, because (assuming my above assumptions are correct) it doesn't have to
do with any lexical scoping, operating instead on the "hidden state" of
whatever async operations are uncommitted. No specific question here, but maybe
a more intuitive grouping is possible that would be syntactically visible.
- Is it correct that `async_wait_stage` will only block the "main" thread of
execution? For example, if (on a particular backend) an async pipeline stage is
executed by another thread, and `async_wait_stage` appeared in that stage,
which thread would it block? I'm assuming the answer is "whatever the backend
lowering does," but I was curious if there was an intended semantics.
Small doc nits for clarity:
- `for k0 in range(125):` I think the index should be `i` here
- `B_local[0] <- B_shared[(I + 1) % 4, ...]` capital `I` used instead of `i`
- `async_scoppe` instead of `async_scope`
Higher-level thoughts:
One concern is moving an async system into TIR that is too narrow. You
discuss this quite a bit, and make a lot of good points. This system clearly
works very well for CUDA, but if there isn't a good transform to token-based
systems, do we end up with two async systems in TIR? Not trying to suggest this
should not be incorporated, just genuinely curious what the long-term plans
might be.
>Importantly, we are not trying to propose a “general async semantics to
TIR”. Rather the goal is to come up with an async design specifically for the
TIR software pipeline transform.
This can be a bit difficult, as it would make sense for future systems to
try to use the existing async system to whatever extent possible. Again, not
trying to suggest a specific course of action, just pointing out that there's a
difficult general-vs-specific trade-off here.
Regarding interopability with a token-based model: you make an excellent
point that a token system has expressibility issues in a TIR program, because
it has to refer to a specific statement at a specific loop iteration. But, it's
also more intuitive. This contrasts with the implicit system here, which feels
unintuitive, but is quite succinct and easily expressible.
I'm still working through what a translation to/from a token system might
look like, but I'm currently thinking that they're much closer than I initially
thought. In either case, what we want (at a high level) is a way to refer to a
specific chunk, blocking execution until that chunk has finished. Your comment
about keeping a buffer in the token case made me realize that it ends up pretty
similar: a token system waiting for chunk _C_ from _N_ iterations previous
might use `wait(C, i-N)`, which looks pretty similar to the `wait(C, N)` we
would use here. (Just my preliminary thoughts; I'm not sure if there's a nicer
abstraction to use in TIR that caters to both systems.)
It's interesting that you mention that there's no easy translation from
tokens to counting (based on MLIR not having implemented one?), but you suspect
the reverse could be simple. Does this suggest that the token system has less
information encoded than the counting system? (I.e., we can go counting -->
tokens but not the reverse because we lost information in the transformation.)
Or is it just specifics of a PTX-like system, not a "counting system" in
general, that make the translation to it hard?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]