houweidong commented on issue #14742: CPU memory leak when running train_yolov3.py URL: https://github.com/apache/incubator-mxnet/issues/14742#issuecomment-485638529 ps aux --sort -rss | head -20 ``` USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 7551 436 13.3 134231992 13047996 pts/0 Sl+ 11:28 122:55 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8016 93.1 7.5 12621720 7394328 pts/0 Sl+ 11:29 25:22 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7967 91.7 7.5 12631112 7379188 pts/0 Sl+ 11:29 25:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8034 91.9 7.5 12604824 7378900 pts/0 Sl+ 11:29 25:02 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7988 92.0 7.5 12605612 7378328 pts/0 Sl+ 11:29 25:05 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7997 91.9 7.5 12604048 7377076 pts/0 Rl+ 11:29 25:03 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8025 92.1 7.5 12573304 7345996 pts/0 Sl+ 11:29 25:06 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7979 92.1 7.5 12570536 7344564 pts/0 Sl+ 11:29 25:06 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8007 92.4 7.5 12562104 7334816 pts/0 Sl+ 11:29 25:11 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8076 0.0 7.4 12646640 7315760 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8085 0.0 7.4 12646640 7315760 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8094 0.0 7.4 12646640 7315760 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8103 0.0 7.4 12646640 7315760 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8115 0.0 7.4 12646640 7315760 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8058 0.0 7.4 12646640 7315756 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8067 0.0 7.4 12646640 7315756 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8049 0.0 7.4 12646640 7315752 pts/0 Sl+ 11:29 0:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 10965 2.5 1.5 12241692 1481400 pts/0 Sl+ 4月05 639:09 /root/pycharm/jre64/bin/java -classpath /root/pycharm/lib/bootstrap.jar:/root/pycharm/lib/extensions.jar:/root/pycharm/lib/util.jar:/root/pycharm/lib/jdom.jar:/root/pycharm/lib/log4j.jar:/root/pycharm/lib/trove4j.jar:/root/pycharm/lib/jna.jar -Xms128m -Xmx750m -XX:ReservedCodeCacheSize=240m -XX:+UseConcMarkSweepGC -XX:SoftRefLRUPolicyMSPerMB=50 -ea -Dsun.io.useCanonCaches=false -Djava.net.preferIPv4Stack=true -Djdk.http.auth.tunneling.disabledSchemes="" -XX:+HeapDumpOnOutOfMemoryError -XX:-OmitStackTraceInFastThrow -Dawt.useSystemAAFontSettings=lcd -Dsun.java2d.renderer=sun.java2d.marlin.MarlinRenderingEngine -XX:ErrorFile=/root/java_error_in_PYCHARM_%p.log -XX:HeapDumpPath=/root/java_error_in_PYCHARM.hprof -Didea.paths.selector=PyCharmCE2018.3 -Djb.vmOptionsFile=/root/pycharm/bin/pycharm64.vmoptions -Didea.platform.prefix=PyCharmCore com.intellij.idea.Main erised 35435 5.0 0.9 3785860 974604 ? Sl 4月09 993:36 /opt/teamviewer/tv_bin/TeamViewer_Desktop ``` this is at epoch 5, CPUmemory occupation is about 30G ...................................................................................................................................................................................................................... ...................................................................................................................................................................................................................... ...................................................................................................................................................................................................................... ps aux --sort -rss | head -20 ``` USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 7551 438 13.5 129978316 13223692 pts/0 Sl+ 11:28 244:38 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7997 90.6 7.5 12600784 7374952 pts/0 Sl+ 11:29 49:44 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8025 91.1 7.5 12590040 7364248 pts/0 Sl+ 11:29 50:00 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7967 90.7 7.5 12638704 7356276 pts/0 Sl+ 11:29 49:47 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7979 91.3 7.5 12573112 7346988 pts/0 Sl+ 11:29 50:06 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8034 90.7 7.5 12572472 7346384 pts/0 Sl+ 11:29 49:47 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 7988 90.8 7.5 12618464 7346056 pts/0 Sl+ 11:29 49:52 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8007 90.9 7.5 12571272 7345172 pts/0 Sl+ 11:29 49:54 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8016 91.3 7.5 12570272 7344144 pts/0 Sl+ 11:29 50:09 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8094 0.3 7.5 12646896 7325600 pts/0 Sl+ 11:29 0:10 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8103 0.3 7.5 12646896 7325600 pts/0 Sl+ 11:29 0:10 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8049 0.3 7.5 12646896 7325596 pts/0 Sl+ 11:29 0:10 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8058 0.3 7.5 12646896 7325596 pts/0 Sl+ 11:29 0:10 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8085 0.3 7.5 12646896 7325596 pts/0 Sl+ 11:29 0:10 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8115 0.3 7.5 12646896 7325596 pts/0 Sl+ 11:29 0:10 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8076 0.2 7.5 12646896 7325592 pts/0 Sl+ 11:29 0:09 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 8067 0.3 7.5 12646896 7325588 pts/0 Sl+ 11:29 0:10 /usr/bin/python3.5 /root/models/yolov3_origin/train_yolo3.py --gpus 1,2 --network darknet53 --syncbn --batch-size 16 -j 8 --val-interval 10 root 10965 2.5 1.5 12241692 1482248 pts/0 Sl+ 4月05 640:07 /root/pycharm/jre64/bin/java -classpath /root/pycharm/lib/bootstrap.jar:/root/pycharm/lib/extensions.jar:/root/pycharm/lib/util.jar:/root/pycharm/lib/jdom.jar:/root/pycharm/lib/log4j.jar:/root/pycharm/lib/trove4j.jar:/root/pycharm/lib/jna.jar -Xms128m -Xmx750m -XX:ReservedCodeCacheSize=240m -XX:+UseConcMarkSweepGC -XX:SoftRefLRUPolicyMSPerMB=50 -ea -Dsun.io.useCanonCaches=false -Djava.net.preferIPv4Stack=true -Djdk.http.auth.tunneling.disabledSchemes="" -XX:+HeapDumpOnOutOfMemoryError -XX:-OmitStackTraceInFastThrow -Dawt.useSystemAAFontSettings=lcd -Dsun.java2d.renderer=sun.java2d.marlin.MarlinRenderingEngine -XX:ErrorFile=/root/java_error_in_PYCHARM_%p.log -XX:HeapDumpPath=/root/java_error_in_PYCHARM.hprof -Didea.paths.selector=PyCharmCE2018.3 -Djb.vmOptionsFile=/root/pycharm/bin/pycharm64.vmoptions -Didea.platform.prefix=PyCharmCore com.intellij.idea.Main erised 35435 5.0 1.0 3785860 990668 ? Sl 4月09 999:05 /opt/teamviewer/tv_bin/TeamViewer_Desktop ``` this is at epoch 10, CPUmemory occupation is about 40G
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
