kohillyang opened a new issue #19159: URL: https://github.com/apache/incubator-mxnet/issues/19159
## Description Hello, I'm using flask with mxnet to write a server. Since it is a web app, we want the GPU memory is fully static allocated. However, as the title said, I found the GPU memory usage keeps increasing when the version of mxnet is 1.6.0post0 and 1.7.0, and if you are using mxnet 1.5.1, then all things are good. Since Flask debug mode uses multi-threading, I think it may be caused by some calls which are not thread-safe.  ## To Reproduce This is a naive fLask server: ```python import mxnet as mx import os os.environ["MXNET_CUDNN_AUTOTUNE_DEFAULT"] = "0" os.environ["MXNET_GPU_MEM_POOL_TYPE"] = "Round" class Predictor(object): def __init__(self): ctx = mx.gpu(0) net = mx.gluon.model_zoo.vision.resnet50_v1() net.initialize() net.collect_params().reset_ctx(ctx) net.hybridize(active=True) max_h = 768 max_w = 768 _ = net(mx.nd.zeros(shape=(1, 3, max_h, max_w), ctx=ctx)) self.ctx = ctx self.net = net def __call__(self, *args, **kwargs): max_h = 768 max_w = 768 x_h = np.random.randint(100, max_h) x_w = np.random.randint(100, max_w) xx = np.random.randn(1, 3, x_h, x_w) y = self.net(mx.nd.array(xx, ctx=self.ctx)) return y.asnumpy().sum() if __name__ == '__main__': import flask import tornado.wsgi import tornado.httpserver import os import cv2 import numpy as np from flask_cors import CORS import os import cv2 import json import logging import base64 os.environ["MXNET_CUDNN_AUTOTUNE_DEFAULT"]="0" DEBUG = True PORT = 21500 app = flask.Flask(__name__) CORS(app, supports_credentials=True) predictor = Predictor() @app.route('/test', methods=['POST']) def net_forward(): try: r = predictor() return None except Exception as e: logging.exception(e) print("failed") return flask.jsonify(str(e)), 400 print("starting webserver...") if DEBUG: app.run(debug=True, host='0.0.0.0', port=PORT) else: http_server = tornado.httpserver.HTTPServer( tornado.wsgi.WSGIContainer(app)) http_server.listen(PORT, address="0.0.0.0") tornado.ioloop.IOLoop.instance().start() ``` And just run the following code to request the server: ``` import base64 import json import time import os import numpy as np import cv2 def remote_call(url): register_data = {"Pic": "xx"} data = json.dumps(register_data) import requests return requests.post(url, data) if __name__ == '__main__': import glob import matplotlib.pyplot as plt folder = '/data1/test_paper_reco_jingyouwang/1st-2st-merged-for-line-detection/val/' for item in glob.iglob(folder + '*.jpg'): register_url = 'http://127.0.0.1:21500/test' while True: try: remote_call(register_url) except Exception as e: print(e) ``` ## Environment I'm using flask 1.0.2 and tornado 5.1, but I think it is independent of the versions of flask and tornado. We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: ``` curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python ``` # paste outputs here ``` /data2/kohill/anaconda3/bin/python /data2/kohill/mx-detection/diagnose.py ----------Python Info---------- Version : 3.7.0 Compiler : GCC 7.2.0 Build : ('default', 'Jun 28 2018 13:15:42') Arch : ('64bit', '') ------------Pip Info----------- Version : 20.2.2 Directory : /data2/kohill/anaconda3/lib/python3.7/site-packages/pip ----------MXNet Info----------- Version : 1.7.0 Directory : /data2/kohill/anaconda3/lib/python3.7/site-packages/mxnet Commit Hash : 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a 64f737cdd59fe88d2c5b479f25d011c5156b6a8a Library : ['/data2/kohill/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so'] Build features: ? CUDA ? CUDNN ? NCCL ? CUDA_RTC ? TENSORRT ? CPU_SSE ? CPU_SSE2 ? CPU_SSE3 ? CPU_SSE4_1 ? CPU_SSE4_2 ? CPU_SSE4A ? CPU_AVX ? CPU_AVX2 ? OPENMP ? SSE ? F16C ? JEMALLOC ? BLAS_OPEN ? BLAS_ATLAS ? BLAS_MKL ? BLAS_APPLE ? LAPACK ? MKLDNN ? OPENCV ? CAFFE ? PROFILER ? DIST_KVSTORE ? CXX14 ? INT64_TENSOR_SIZE ? SIGNAL_HANDLER ? DEBUG ? TVM_OP ----------System Info---------- Platform : Linux-4.15.0-117-generic-x86_64-with-debian-stretch-sid system : Linux node : ubuntu release : 4.15.0-117-generic version : #118~16.04.1-Ubuntu SMP Sat Sep 5 23:35:06 UTC 2020 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Stepping: 2 CPU MHz: 1200.672 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 5002.04 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0060 sec, LOAD: 1.4688 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1272 sec, LOAD: 1.2150 sec. Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1045)>, DNS finished in 0.10556268692016602 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0053 sec, LOAD: 1.4548 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0048 sec, LOAD: 11.7945 sec. Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.005016326904296875 sec. ----------Environment---------- ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
