PhilippvK opened a new issue #8953: URL: https://github.com/apache/tvm/issues/8953
I realized that two models (`int8` quantized `vww` and `aww` of the [MLPerf Tiny benchmark](https://github.com/mlcommons/tiny)) which have worked fine using the graph executor a few month ago, stopped working with the latest TVM version. ### Expected behavior Running the builded models (on x86 for simplicity) via the CRT graph runtime should not result in any error messages, segmentation faults or other crashes. ### Actual behavior The following is printed to the terminal during the Initialization of the graph executor (e.g. while parsing the JSON graph): ``` Error: string size greater than buffer size (120). Error at line 331, JSON object expect '}' or ',' but got '"' Error at line 331, Expect '"' but reach end of line Error at line 331, Expect ':' but get '"' do not support key invalid format failed to load an element in `nodes` field in graph executor node. invalid format Segmentation fault (core dumped) ``` The actual error is the the first line while the others just seem to be consequences of the previous problem. ### Investigations The big question to me was why is this only happening for these too models? As the error message is about string sizes I figured out, that this is related to the `char func_name[120];` array in the `struct TVMOpParam`. (See [include/tvm/runtime/crt/graph_executor.h](https://github.com/apache/tvm/blob/9b034d729f80f6f19106ae435221e4c48df44618/include/tvm/runtime/crt/graph_executor.h#L39)) Looking at the `graph.json` file I found out that there are in fact function names with a length slightly larger than 120 characters. I also get rid of the error temporarily by limiting the maximum number of fused ops (using `relay.FuseOps.max_depth`) which results in shorter function names. Here is the code used by TVM to generate unique truncated function names by appending a hash: ``` readable_name_stream_ << "fused"; auto outputs = this->VisitExpr(prim_func->body); auto candidate_name = readable_name_stream_.str(); constexpr static size_t kMaxFuncNameLength = 80; if (candidate_name.size() > kMaxFuncNameLength) { std::stringstream truncated_name; truncated_name << candidate_name.substr(0, kMaxFuncNameLength); truncated_name << "_" << std::hash<std::string>{}(candidate_name) << "_"; candidate_name = truncated_name.str(); } ``` Then I did the following calculation based on this (example) JSON `func_name` entry: `tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip_cast_s_9959535092109263429__2` ``` 120 (max allowed function length) - 80 (truncated original function name) - 19 (length of appended hash) - 5 ('fused' prefix) - 4 (total number of '_' characters) = 12 (remaining number of characters for the model name INCLUDING the 'tvmgen_' prefix which requires another 7 characters) ``` This leads to the conclusion that the `mod_name` passed to the `relay.build()` function needs to have at most 5 characters which is suboptimal, especially considering that the default name seems to the `default` (e.g. `tvmgen_default`). I suppose that the easiest possible fixes would be either - increasing the size of the aforementioned `func_name[]` array by a few characters - or reducing `kMaxFuncNameLength` by some characters. In addition it might be good to limit the length of the model name which can be specified by the user to make sure that this does not break again for too long model names in the future. Why does this only happen for a few models? - Most operator/function names afer the `FuseOps` transformation do not exceed the `kMaxFuncNameLength` and therefore do not need to be truncated. ### Environment - OS: Ubuntu 20.04 - Architecture: x86_64 - Python version: 3.8.10 - TVM version: based on commit 9b034d729f80f6f19106ae435221e4c48df44618 from yesterday - Target: MicroTVM (Standalone) using CRT graph executor/runtime ([`apps/bundle_deploy/bundle_static.c`](https://github.com/apache/tvm/blob/main/apps/bundle_deploy/bundle_static.c)) - Script: See next section! ### Steps to reproduce Here is a [branch](https://github.com/PhilippvK/tvm/tree/bug-demo) I set up to reproduce this issue using the aforementioned TFLite models: Step-by-Step instructions for reproducing the issue can be found [here](https://github.com/PhilippvK/tvm/blob/bug-demo/BUG.md). There is also a [GitHub Action](https://github.com/PhilippvK/tvm/blob/bug-demo/.github/workflows/bug.yml) which runs the demonstration of the bug. The log output can be investigated here: https://github.com/PhilippvK/tvm/runs/3531514621?check_suite_focus=true -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
