PhilippvK opened a new issue #8953:
URL: https://github.com/apache/tvm/issues/8953


   I realized that two models (`int8` quantized `vww` and `aww` of the [MLPerf 
Tiny benchmark](https://github.com/mlcommons/tiny)) which have worked fine 
using the graph executor a few month ago, stopped working with the latest TVM 
version.
   
   ### Expected behavior
   
   Running the builded models (on x86 for simplicity) via the CRT graph runtime 
should not result in any error messages, segmentation faults or other crashes.
   
   ### Actual behavior
   
   The following is printed to the terminal during the Initialization of the 
graph executor (e.g. while parsing the JSON graph):
   
   ```
   Error: string size greater than buffer size (120).
   Error at line 331, JSON object expect '}' or ',' but got '"'
   Error at line 331, Expect '"' but reach end of line
   Error at line 331, Expect ':' but get '"'
   do not support key  
   invalid format
   failed to load an element in `nodes` field in graph executor node.
   invalid format
   Segmentation fault (core dumped)
   ```
   
   The actual error is the the first line while the others just seem to be 
consequences of the previous problem.
   
   ### Investigations
   
   The big question to me was why is this only happening for these too models? 
As the error message is about string sizes I figured out, that this is related 
to the `char func_name[120];` array in the `struct TVMOpParam`. (See 
[include/tvm/runtime/crt/graph_executor.h](https://github.com/apache/tvm/blob/9b034d729f80f6f19106ae435221e4c48df44618/include/tvm/runtime/crt/graph_executor.h#L39))
   
   Looking at the `graph.json` file I found out that there are in fact function 
names with a length  slightly larger than 120 characters. I also get rid of the 
error temporarily by limiting the maximum number of fused ops (using 
`relay.FuseOps.max_depth`) which results in shorter function names. 
   
   Here is the code used by TVM to generate unique truncated function names by 
appending a hash:
   
   ```
       readable_name_stream_ << "fused";
       auto outputs = this->VisitExpr(prim_func->body);
       auto candidate_name = readable_name_stream_.str();
       constexpr static size_t kMaxFuncNameLength = 80;
       if (candidate_name.size() > kMaxFuncNameLength) {
         std::stringstream truncated_name;
         truncated_name << candidate_name.substr(0, kMaxFuncNameLength);
         truncated_name << "_" << std::hash<std::string>{}(candidate_name) << 
"_";
         candidate_name = truncated_name.str();
       }
   ```
   
   Then I did the following calculation based on this (example) JSON 
`func_name` entry: 
`tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip_cast_s_9959535092109263429__2`
   
   ```
   120 (max allowed function length)
   - 80 (truncated original function name)
   - 19 (length of appended hash)
   - 5 ('fused' prefix)
   - 4 (total number of '_' characters)
   =  12 (remaining number of characters for the model name INCLUDING the 
'tvmgen_' prefix which requires another 7 characters)
   ```
   
   This leads to the conclusion that the `mod_name` passed to the 
`relay.build()` function needs to have at most 5 characters which is 
suboptimal, especially considering that the default name seems to the `default` 
(e.g. `tvmgen_default`).
   
   I suppose that the easiest possible fixes would be either
   - increasing the size of the aforementioned `func_name[]` array by a few 
characters
   - or reducing `kMaxFuncNameLength` by some characters.
   
   In addition it might be good to limit the length of the model name which can 
be specified by the user to make sure that this does not break again for too 
long model names in the future.
   
   Why does this only happen for a few models? - Most operator/function names 
afer the `FuseOps` transformation do not exceed the `kMaxFuncNameLength` and 
therefore do not need to be truncated.
   
   ### Environment
   
   - OS: Ubuntu 20.04
   - Architecture: x86_64
   - Python version: 3.8.10
   - TVM version: based on commit 9b034d729f80f6f19106ae435221e4c48df44618 from 
yesterday
   - Target: MicroTVM (Standalone) using CRT graph executor/runtime 
([`apps/bundle_deploy/bundle_static.c`](https://github.com/apache/tvm/blob/main/apps/bundle_deploy/bundle_static.c))
   - Script: See next section!
   
   ### Steps to reproduce
   
   Here is a [branch](https://github.com/PhilippvK/tvm/tree/bug-demo) I set up 
to reproduce this issue using the aforementioned TFLite models:
   
   Step-by-Step instructions for reproducing the issue can be found 
[here](https://github.com/PhilippvK/tvm/blob/bug-demo/BUG.md).
   
   There is also a [GitHub 
Action](https://github.com/PhilippvK/tvm/blob/bug-demo/.github/workflows/bug.yml)
 which runs the demonstration of the bug. The log output can be investigated 
here: https://github.com/PhilippvK/tvm/runs/3531514621?check_suite_focus=true
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to