milandesai opened a new issue #10438: Loading an older model with a custom 
operator from Java/Scala more than twice causes MXNet to crash
URL: https://github.com/apache/incubator-mxnet/issues/10438
 
 
   ## Version
   MXNet 1.1.0 with Scala
   
   ## Description
   Our Scala application uses MxNet and includes several tests. Many of these 
tests load the same model files, so as a result the same process ends up 
loading a model multiple times. This normally works fine, but when upgrading 
from 0.11.0 to 1.1.0 we discovered an odd issue causing our tests to fail. We 
discovered that if you have a model that uses a custom operator and was built 
from an mxnet version older than the current one (thus prompting the 
"Attempting to upgrade..." message), and you load it more than twice in the 
same process, then on the third reload it crashes.
   
   ## Analysis
   Based on my investigation, the relevant lines appear to be:
   
   - `legacy_json_util.cc#UpgradeJSON_FixParsing (line 30)`: This method is run 
whenever a model older than the current version is loaded. It invokes the 
attribute parser for each op.
   - `legacy_json_util.cc#UpgradeJSON_Parse (line 92)`: This method is always 
run. It also invokes the attribute parser for each op
   - `custom.cc#AttrParser (line 80)`: This line resets the shared pointer to 
the custom operator's callback list. Note that it also sets up a deleter 
function that unregisters the op by invoking the function in the next bullet.
   - `ml_dmlc_mxnet_native_c_api.cc#opPropDel (line 2426)`: This function 
unregisters the custom operator. However, there is logic here that tracks how 
many times this function has been invoked. It only deregisters the custom op on 
the _third_ invocation.
   
   Suppose we are running MXNet 1.1.0 and load a model built on 0.11.0. Because 
of the version mismatch, both of the UpgradeJSON stages are run, each one 
running the attribute parser for each op. When the custom op's attribute parser 
is run the second time, the shared pointer to the operator's callback list is 
reset, triggering the unregister function to be invoked. This function, for 
now, simply updates a counter. The counter now has value 1.
   
   Now suppose during the same process, we load the same model again. The same 
scenario as above happens again, except this time the counter is incremented to 
2.
   
   The third time the model is loaded, again the same scenario reoccurs, except 
this time the counter is already at value 2, so the unregister function 
actually unregisters the custom op rather than simply updating the counter. Our 
custom operator is no longer registered, so shortly thereafter we get an 
"Operator [op-name] is not registered" fatal error.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to