altanh commented on issue #8579: URL: https://github.com/apache/tvm/issues/8579#issuecomment-888686075
After analysis we believe this regression is due to a combination of changes in #8486 and also an inefficient loop in the `check_grad` function: 1. #8486 replaced the previous global (or thread-local) `CompileEngine` with a fresh `TECompiler` per `Interpreter` instance. This means that the previous behavior of caching lowered functions (for given input types) globally across all interpreters no longer holds. 2. in `check_grad` at https://github.com/apache/tvm/blob/720e7b1ebd9b789a1100dee7536d0633c7941dd1/python/tvm/relay/testing/__init__.py#L163 notice that the forward function is being recompiled by the interpreter **executor** for **every element of the input tensor**. Important to note here that the interpreter executor is actually creating a new `Interpreter` object for every evaluated expression (and hence losing the cache) causing a complete recompile on each evaluation. ## Proposed Solution The second point can be easily rectified by hoisting the forward function evaluation outside the hot loop, as the function doesn't change. In fact it only needs to be evaluated once per device and target combination. This will immediately fix the extreme regression. A PR will be posted soon. Regarding the first point, we would like to move away from relying on hidden global caching for performance as much as possible as this has caused confusion in the past. Thus we will not modify the new interpreter behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
