areusch commented on a change in pull request #7785:
URL: https://github.com/apache/tvm/pull/7785#discussion_r614264731



##########
File path: src/target/source/codegen_c_host.cc
##########
@@ -40,13 +40,16 @@ namespace codegen {
 
 CodeGenCHost::CodeGenCHost() { module_name_ = 
GetUniqueName("__tvm_module_ctx"); }
 
-void CodeGenCHost::Init(bool output_ssa, bool emit_asserts, std::string 
target_str) {
+void CodeGenCHost::Init(bool output_ssa, bool emit_asserts, bool 
is_aot_executor,
+                        std::string target_str) {

Review comment:
       >> it might be harder for us to handle device functions that does not 
directly corresponds to a symbol
   >Could you give an example for such function?
   
   I think @tqchen is referring to e.g. functions which offload compute to 
devices. This would be e.g. functions produced by a non-`llvm` or `c` backend, 
so BYOC for µTVM or CUDA, Vulkan, etc. on traditional OS.
   
   The thing is, as I mentioned in the discuss forum post, these functions are 
actually launched from the `target_host` e.g. CPU by calling a closure returned 
from `GetFunction`. On traditional OS, this `GetFunction` + closure serves to 
both program the accelerator and launch the compute (split is unspecified 
between the two, but typically `GetFunction` does the programming).
   
   In micro-land, I don't recommend we adopt any sort of closures, and I also 
think that it's likely that the "programming" part will happen as part of 
compilation or as part of the Module initialization. So I'd recommend we assume 
that we'll move the "accelerator programming" step  _at least_ into some 
Module-init function. Or it may even be handled outside of the executor. Given 
that, `GetFunction` is essentially a no-op, and it makes sense to me to presume 
that the PackedFunc impl will launch and block on accelerated compute in these 
cases. As we move to handling asynchronous execution, we may need to modify 
this interface (but that will likely amount to a "launch" and a "sync" 
function, so not too different).
   
   >> Given the potential need to have a different path on the embedded 
setting(to avoid string lookup and use symbol), we should clarify and document 
what is the CRT standard so that the new code generator can refer to
   >So, about this, for me the "runtime" of the graph is a triplet of:
   >
   >Calling convention (registry, dynamic or plain)
   >Backend functions (platform specific vs defined in TVM)
   >Graph execution (aot vs graph)
   
   Runtime has been pretty overloaded in TVM, so given the recent PR to rename 
to GraphExecutor, I prefer we just use "runtime" to refer to `c_runtime_api.h` 
and `c_backend_api.h` implementation and dependencies of those. GraphExecutor 
is built on top of that, as should AOTExecutor be, so I think "runtime" should 
exclude Executor.
   
   I agree it's going to be an impossibly-long PR if we try to fit in the full 
general AOT integration, and don't think we should do that. I suggest that 
here, we just implement the logic needed to produce the top-level TIR function 
and test it, and then consider things like changes to runtime memory management 
in follow-on PRs. This might be going a bit too far backwards, but e.g. if it 
were to simplify things, we could even switch to emitting 
`tir.call_packed_lowered` in AOT codegen and then implement the direct call as 
a follow-on PR.
   
   >So, I am not sure we can come up in day1 with a full general aot 
integration. I would say :
   >
   >Let's stick with the current AOT definition for now (which is for sure more 
general that we had in mind at the >beginning :) )
   >I will remove the is_aot from the codegen, but I will use a more generic 
use_plain_calling_convention
   >Let's document the exact definition of CRT and AOT and let's try to 
generalize later
   
   I think this makes sense--let's clarify a couple points:
    - definition of AOT: just the piece that generates the top-level TIR 
function to serve as an Executor
    - definition of CRT: `c_runtime_api.h` and `c_backend_api.h` implementation 
and dependencies of those
   
   Now as to what we should put in this PR to move forward--the rest of this 
comment is my proposal, feel free to debate.
   
   I'd suggest we just limit this PR to the top-level TIR function which serves 
as part of the Executor implementation (e.g. `Executor#Run`). That would mean 
deferring:
    - Memory management changes--just use `TVMBackendAllocWorkspace` in this PR 
even for intermediate graph tensors
    - firmware-facing API--just enough to make test cases for the top-level TIR 
function run
    - calling convention: can use what you guys have put together here, or can 
even emit `tir.call_packed_lowered` nodes if that's easier. I think they are 
about equivalent, aside that `call_packed_lowered` uses 
`TVMBackendGetFuncFromEnv`.
   
   Near-future directions (e.g. stuff implemented here I am not saying we 
should drop; just move to next PR):
   - Memory management: let's discuss this soon and implement the change as a 
GraphPlanMemory change. That should get us to a solution that works for all 
executors and is extensible should users wish to implement their own allocation 
scheme. I do not view the existing page-based allocator as particularly fit for 
production standalone inference.
   - PackedFunc invocation from other PackedFunc: let's discuss this with an 
eye to both BYOC and `c`/`llvm` modules, and come up with an efficient solution 
that makes sense for a pure-C world. This mostly doesn't impact the 
firmware-facing API, so let's initially discuss this separately.
   - firmware-facing API: Let's also discuss this, either as part of your 
initial RFC or as a new thread. 
   
   Would this make sense to you guys as a way forward? Happy to discuss other 
options too. I do think it'd be better to merge a part of this PR and move 
forward with the others, rather than create one giant PR and discuss all of the 
pieces at once. At the same time, I do realize we should ensure we're working 
towards a vision that makes sense for microcontrollers, which involves many 
pieces given how different the deployment environment is from traditional OS.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to