[GitHub] [tvm] mbs-octoml commented on a diff in pull request #11358: WIP: [Virtual Machine] Implementation of 'set_output_zero_copy'

GitBox Wed, 18 May 2022 13:52:07 -0700


mbs-octoml commented on code in PR #11358:
URL: https://github.com/apache/tvm/pull/11358#discussion_r876367929



##########
src/runtime/vm/vm.cc:
##########
@@ -521,13 +603,25 @@ int64_t VirtualMachine::LoadScalarInt(Index r) const {
   return result;
 }
 
-void VirtualMachine::RunLoop() {
+Index VirtualMachine::GetResultRegisterIndex() const {
+  Index op_ind = 0;
+  Instruction instr;
+  // TODO(vvchernov): can it be endless loop?
+  do {
+    instr = code_[op_ind++];
+  } while (instr.op == Opcode::Ret);
+
+  return instr.result;
+}
+
+void VirtualMachine::RunLoop(bool set_output_enabled) {
   ICHECK(this->exec_);
   ICHECK(this->code_);
   pc_ = 0;
   Index frame_start = frames_.size();
-  while (true) {
-  main_loop:
+  Index res_reg_index = GetResultRegisterIndex();

Review Comment:
   For tuple results we'd need to redirect the allocs for the tuple indexes, 
which may be scattered quite widely within the code. So that suggests the 
compiler should record metadata for all that.
   
   But I'm wondering if it would be better to bite-the-bullet and switch the VM 
to DPS. After that only Invoke-without-outputs would be the special case, 
requiring inspection of metadata to alloc the output tensors, make the call, 
and optionally construct the tuple result. I know that's a much bigger change, 
and I guess we could make that change after this PR given the invoke APIs will 
be the same.
   
   Is there a high pressure customer use case which justifies that two step 
approach?
   



##########
src/runtime/vm/vm.cc:
##########
@@ -537,7 +631,7 @@ void VirtualMachine::RunLoop() {
         from_obj = ReadRegister(instr.from);
         WriteRegister(instr.dst, from_obj);
         pc_++;
-        goto main_loop;
+        break;

Review Comment:
   nit: This weird control flow may be the result of someone trying to improve 
I$ behavior so we should check. I recall there's a known issue with the VM 
being slightly slower than the GraphExecutor due to cache issues, but I suspect 
that's more likely to be due to D$ effects with the extra indirections around 
registers or something.



##########
src/runtime/vm/vm.cc:
##########
@@ -521,13 +603,25 @@ int64_t VirtualMachine::LoadScalarInt(Index r) const {
   return result;
 }
 
-void VirtualMachine::RunLoop() {
+Index VirtualMachine::GetResultRegisterIndex() const {
+  Index op_ind = 0;
+  Instruction instr;
+  // TODO(vvchernov): can it be endless loop?
+  do {
+    instr = code_[op_ind++];
+  } while (instr.op == Opcode::Ret);

Review Comment:
   != Opcode::Ret



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] mbs-octoml commented on a diff in pull request #11358: WIP: [Virtual Machine] Implementation of 'set_output_zero_copy'

Reply via email to