## Description
mxnet-mkl hangs indefinitely when trying to spawn subprocesses (using mxnet) in 
an environment that uses MKL-optimized numpy. This is a recent issue we are 
observing with Sockeye and may be related to #8532, but it can be reproduced 
without Sockeye (see below).

## Environment info (Required)
- Python 3.6.6
- MacOs
- mxnet-mkl==1.3.0.post0
- Anaconda Numpy (with MKL optimization): `conda install mkl ; conda install 
numpy`

## Minimum reproducible example
The following code reliably reproduces the deadlock/indefinite hang in the main 
process.
It creates a minimal module and 'trains' for 500 iterations, spawning itself in 
'testing mode' every 100 iterations. The testing mode is the same mxnet code, 
ran for fewer iterations. The main process is supposed to wait until the 
subprocess finishes before starting the next one.

code.py:
```python
import subprocess
import sys

import mxnet as mx

if __name__ == '__main__':

    if len(sys.argv) > 1:
        print("TESTING")
        test = True
        iterations = 50
    else:
        print("TRAINING")
        test = False
        iterations = 500

    x = mx.sym.Variable('x')
    y = mx.sym.Variable('y')

    sym = mx.sym.FullyConnected(x, num_hidden=5)
    sym = mx.sym.SoftmaxOutput(sym, y)

    x_data = mx.nd.uniform(0, 1, (32, 16))
    y_data = mx.nd.zeros((32, 5))
    batch = mx.io.DataBatch(data=[x_data], label=[y_data])

    mod = mx.mod.Module(sym, data_names=['x'], label_names=['y'])
    mod.bind(data_shapes=[mx.io.DataDesc('x', shape=x_data.shape)],
             label_shapes=[mx.io.DataDesc('y', shape=y_data.shape)],
             for_training=True, grad_req='write' if not test else 'null')
    mod.init_params()
    mod.init_optimizer()
    process = None
    for i in range(iterations):
        mod.forward(batch)
        if not test:
            mod.backward()
            mod.update()
        if i % 100 == 0 and i > 0:
            print(i)
            if not test:
                if process:
                    print("Waiting for process")
                    process.wait()
                cmd = [sys.executable, sys.argv[0], 'test']
                print("Starting process: '%s'" % " ".join(cmd))
                process = subprocess.Popen(cmd)
    if process:
        process.wait()
```

## Steps to reproduce
1. conda install mkl
2. conda install numpy
3. pip install mxnet-mkl
4. python3 code.py

## What have you tried to solve it?
Replacing `mxnet-mkl` with `mxnet` or conda Numpy with pip-installed numpy 
(`conda uninstall numpy; conda uninstall mkl; pip install numpy`) resolves the 
issue and the output is as expected:
```
TRAINING
100
Starting process: '/Users/fhieber/miniconda3/bin/python3 
sockeye/process_test.py test'
200
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 
sockeye/process_test.py test'
300
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 
sockeye/process_test.py test'
400
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 
sockeye/process_test.py test'
TESTING
```


[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/12710 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to