Thanks @pengzhao-intel, here is a minimal example to reproduce the issue.
You can run this:
```
conda install mkl
conda install numpy
pip install mxnet-mkl
git clone https://github.com/awslabs/sockeye.git
cd sockeye
python -m sockeye.train --num-layers 1 -s setup.py -t setup.py -vs setup.py -vt 
setup.py -o test_model --batch-size 5 --batch-type sentence --num-embed 8 
--transformer-model-size 8 --overwrite-output --use-cpu 
--transformer-attention-heads 1 --checkpoint-frequency 100 
--decode-and-evaluate 2
```
(this will train a tiny model on the setup.py file, but will hang once reached 
100 updates and spawns a CheckpointDecoder subprocess to decode 2 sentences of 
the validation data.
This will hang with the last log line being:
```
[INFO:sockeye.training] Starting process: Decoder-1
```

If you set `--decode-and-evaluate 0`, no decoder subprocess will be started at 
each checkpoint, and training runs fine.

If you run
```
conda uninstall mkl
conda uninstall numpy
pip install numpy
```
and run the same training with `--decode-and-evaluate > 0`, no hanging will 
occur.

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/8532 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to