Thanks @pengzhao-intel, here is a minimal example to reproduce the issue. You can run this: ``` conda install mkl conda install numpy pip install mxnet-mkl git clone https://github.com/awslabs/sockeye.git cd sockeye python -m sockeye.train --num-layers 1 -s setup.py -t setup.py -vs setup.py -vt setup.py -o test_model --batch-size 5 --batch-type sentence --num-embed 8 --transformer-model-size 8 --overwrite-output --use-cpu --transformer-attention-heads 1 --checkpoint-frequency 100 --decode-and-evaluate 2 ``` (this will train a tiny model on the setup.py file, but will hang once reached 100 updates and spawns a CheckpointDecoder subprocess to decode 2 sentences of the validation data. This will hang with the last log line being: ``` [INFO:sockeye.training] Starting process: Decoder-1 ```
If you set `--decode-and-evaluate 0`, no decoder subprocess will be started at each checkpoint, and training runs fine. If you run ``` conda uninstall mkl conda uninstall numpy pip install numpy ``` and run the same training with `--decode-and-evaluate > 0`, no hanging will occur. [ Full content available at: https://github.com/apache/incubator-mxnet/issues/8532 ] This message was relayed via gitbox.apache.org for [email protected]
