## Description
mxnet-mkl hangs indefinitely when trying to spawn subprocesses (using mxnet) in
an environment that uses MKL-optimized numpy. This is a recent issue we are
observing with Sockeye and may be related to #8532, but it can be reproduced
without Sockeye (see below).
## Environment info (Required)
- Python 3.6.6
- MacOs
- mxnet-mkl==1.3.0.post0
- Anaconda Numpy (with MKL optimization): `conda install mkl ; conda install
numpy`
## Minimum reproducible example
The following code reliably reproduces the deadlock/indefinite hang in the main
process.
It creates a minimal module and 'trains' for 500 iterations, spawning itself in
'testing mode' every 100 iterations. The testing mode is the same mxnet code,
ran for fewer iterations. The main process is supposed to wait until the
subprocess finishes before starting the next one.
code.py:
```python
import subprocess
import sys
import mxnet as mx
if __name__ == '__main__':
if len(sys.argv) > 1:
print("TESTING")
test = True
iterations = 50
else:
print("TRAINING")
test = False
iterations = 500
x = mx.sym.Variable('x')
y = mx.sym.Variable('y')
sym = mx.sym.FullyConnected(x, num_hidden=5)
sym = mx.sym.SoftmaxOutput(sym, y)
x_data = mx.nd.uniform(0, 1, (32, 16))
y_data = mx.nd.zeros((32, 5))
batch = mx.io.DataBatch(data=[x_data], label=[y_data])
mod = mx.mod.Module(sym, data_names=['x'], label_names=['y'])
mod.bind(data_shapes=[mx.io.DataDesc('x', shape=x_data.shape)],
label_shapes=[mx.io.DataDesc('y', shape=y_data.shape)],
for_training=True, grad_req='write' if not test else 'null')
mod.init_params()
mod.init_optimizer()
process = None
for i in range(iterations):
mod.forward(batch)
if not test:
mod.backward()
mod.update()
if i % 100 == 0 and i > 0:
print(i)
if not test:
if process:
print("Waiting for process")
process.wait()
cmd = [sys.executable, sys.argv[0], 'test']
print("Starting process: '%s'" % " ".join(cmd))
process = subprocess.Popen(cmd)
if process:
process.wait()
```
## Steps to reproduce
1. conda install mkl
2. conda install numpy
3. pip install mxnet-mkl
4. python3 code.py
## What have you tried to solve it?
Replacing `mxnet-mkl` with `mxnet` or conda Numpy with pip-installed numpy
(`conda uninstall numpy; conda uninstall mkl; pip install numpy`) resolves the
issue and the output is as expected:
```
TRAINING
100
Starting process: '/Users/fhieber/miniconda3/bin/python3
sockeye/process_test.py test'
200
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3
sockeye/process_test.py test'
300
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3
sockeye/process_test.py test'
400
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3
sockeye/process_test.py test'
TESTING
```
[ Full content available at:
https://github.com/apache/incubator-mxnet/issues/12710 ]
This message was relayed via gitbox.apache.org for [email protected]