milandesai opened a new issue #10438: Loading an older model with a custom
operator from Java/Scala more than twice causes MXNet to crash
MXNet 1.1.0 with Scala
Our Scala application uses MxNet and includes several tests. Many of these
tests load the same model files, so as a result the same process ends up
loading a model multiple times. This normally works fine, but when upgrading
from 0.11.0 to 1.1.0 we discovered an odd issue causing our tests to fail. We
discovered that if you have a model that uses a custom operator and was built
from an mxnet version older than the current one (thus prompting the
"Attempting to upgrade..." message), and you load it more than twice in the
same process, then on the third reload it crashes.
Based on my investigation, the relevant lines appear to be:
- `legacy_json_util.cc#UpgradeJSON_FixParsing (line 30)`: This method is run
whenever a model older than the current version is loaded. It invokes the
attribute parser for each op.
- `legacy_json_util.cc#UpgradeJSON_Parse (line 92)`: This method is always
run. It also invokes the attribute parser for each op
- `custom.cc#AttrParser (line 80)`: This line resets the shared pointer to
the custom operator's callback list. Note that it also sets up a deleter
function that unregisters the op by invoking the function in the next bullet.
- `ml_dmlc_mxnet_native_c_api.cc#opPropDel (line 2426)`: This function
unregisters the custom operator. However, there is logic here that tracks how
many times this function has been invoked. It only deregisters the custom op on
the _third_ invocation.
Suppose we are running MXNet 1.1.0 and load a model built on 0.11.0. Because
of the version mismatch, both of the UpgradeJSON stages are run, each one
running the attribute parser for each op. When the custom op's attribute parser
is run the second time, the shared pointer to the operator's callback list is
reset, triggering the unregister function to be invoked. This function, for
now, simply updates a counter. The counter now has value 1.
Now suppose during the same process, we load the same model again. The same
scenario as above happens again, except this time the counter is incremented to
The third time the model is loaded, again the same scenario reoccurs, except
this time the counter is already at value 2, so the unregister function
actually unregisters the custom op rather than simply updating the counter. Our
custom operator is no longer registered, so shortly thereafter we get an
"Operator [op-name] is not registered" fatal error.
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
Apache Git Services