YutingZhang opened a new issue #13593: Low CPU usage of MXNet in subprocesses URL: https://github.com/apache/incubator-mxnet/issues/13593 MXNet has low CPU usage when running CPU operations in multiple process scenarios. Specifically, for MXNet computation in a subprocess, MxNet can use only 1 or 2 CPUs to do its job. This issue shows different behavior for different variants of MxNet (see below) and on different machines ... This issue is critical because it slows down the multiprocess object-detection data-loading in gluoncv very significantly, making Faster-RCNN training in gluoncv unusable. This is tested on the 20181207 version, and other versions (e.g., 1.3.1) show similar problems. Code to reproduce the issue Filename: `mxnet_cpu_test.py` ```python import argparse import sys from concurrent import futures import time import numpy as np mx=None def run(need_import): if need_import: import mxnet as mx else: global mx A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000)) while True: A = mx.nd.dot(A, A) def parse_args(): parser = argparse.ArgumentParser("benchmark mxnet cpu") parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0) parser.add_argument('--late-import', action='store_true') return parser.parse_args() def main(args): if args.num_workers == 0: print("Main process") try: run(need_import=args.late_import) except KeyboardInterrupt: pass else: print("Subprocesses") ex = futures.ProcessPoolExecutor(args.num_workers) for _ in range(args.num_workers): ex.submit(run, need_import=args.late_import) while True: try: time.sleep(10000) except KeyboardInterrupt: ex.shutdown(wait=False) break print("Stopped") if __name__ == "__main__": args = parse_args() if not args.late_import: import mxnet as mx main(args) ``` Detailed experiments: - Run in the main process: `python3 mxnet_cpu_test.py --num-workers=0`  Working fine for all mxnet variants (GPU or CPU-only). - Run in two subproceses -- `mxnet-cu90` on p3.16x: `python3 mxnet_cpu_test.py --num-workers=2`  It uses only 2 CPUs per subprocess. -- `mxnet-mkl` on p3.16x: `python3 mxnet_cpu_test.py --num-workers=2`  Same here. It uses only 2 CPUs per subprocess. -- `mxnet-mkl` on **CPU-only machine c5.18x**: `python3 mxnet_cpu_test.py --num-workers=2`  **Even worse.** It uses only 1.5 CPUs per subprocess. -- However, for vanilla CPU-version `mxnet` on c5.18x: `python3 mxnet_cpu_test.py --num-workers=2`  It is working better. At least, it uses 5 CPUs per subprocess. -- Weirdly, still vanilla CPU-version `mxnet` but on **GPU machine p3.16x**: `python3 mxnet_cpu_test.py --num-workers=2`  It is working worse, i.e., 2 CPUs per subprocesses. - This problem seems relevant to how MXNet manage the thread per subprocess. If I do not `import mxnet` in the main process and instead `import mxnet` in each subprocess: `python3 mxnet_cpu_test.py --num-workers=2 --late-import`  Then everything is working fine.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
