Johnson9009 opened a new pull request, #15187:
URL: https://github.com/apache/tvm/pull/15187

   By using RPC server in NPU board, at some time a compiled model will hang 
the NPU, because of the buggy operator libraries of NPU toolchain, so we must 
to use the session_timeout to ensure the board resource can be released by the 
hang jobs.
   
   Currently the handling of session timeout error in RPC server is not good, 
it just kill the server loop sub process, then in the destructor of  class 
`RPCEndpoint` will send the code of `kShutdown` to the RPC client, but the RPC 
client expect receive the code of `kReturn` or `kException`, so users will see 
the error message that like the one reported in  
https://github.com/apache/tvm/issues/15151, this error report will make users 
very confused and don't know what's happened.
   
   When using tuning to search a good schedule for operators, we only want to 
ignore the RPC session timeout error that indicate the schedule generated is an 
illegal one, but other error reported by the RPC server may help us find the 
potential bug of our tool chain built on top of TVM, so the RPC session timeout 
error should be split to a standalone TVM error class.
   
   This PR implemented these requirements by sending the RPC session timeout 
error message as a PRC server exception to the RPC client before kill the 
server loop sub process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to