ChaiBapchya commented on issue #16603: Significant slowdown in some DGL models URL: https://github.com/apache/incubator-mxnet/issues/16603#issuecomment-547685567 Tried reproducing Got a segfault Build flags ``` [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✖ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔ DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG] ``` DGL & MXNet versions ``` python3 -c "import mxnet;import dgl;print(mxnet.__version__);print(dgl.__version__)" 1.5.1 0.4 ``` Log ``` /workspace/dgl_issue/dgl/apps/kg$ DGLBACKEND=mxnet python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 --hidden_dim 2000 --gam ma 500.0 --lr 0.1 --max_step 100000 --batch_size_eval 16 --gpu 0 --valid --test -adv Logs are being recorded at: ckpts/DistMult_FB15k_1/train.log File not found. Downloading from https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/FB15k.zip 8it [00:00, 11.36it/s] Download finished. Unzipping the file... Unzip finished. |Train|: 483142 |valid|: 50000 |test|: 59071 eval on 50000 edges eval on 50000 edges eval on 59071 edges eval on 59071 edges [Train](0/100000) average pos_loss: 0.6942696571350098 [Train](0/100000) average neg_loss: 0.7048686146736145 [Train](0/100000) average loss: 0.6995691061019897 [Train](0/100000) average regularization: 0.05789175629615784 0.23716044425964355 [Train](1000/100000) average pos_loss: 0.2992950127273798 [Train](1000/100000) average neg_loss: 0.571561429977417 [Train](1000/100000) average loss: 0.43542822140455245 [Train](1000/100000) average regularization: 0.12885531912744044 16.814379453659058 ``` ... Trains for rest of the steps ``` [Train](99000/100000) average pos_loss: 0.18922672630846502 [Train](99000/100000) average neg_loss: 0.3200621349364519 [Train](99000/100000) average loss: 0.2546444304138422 [Train](99000/100000) average regularization: 0.10238696759194135 13.827665090560913 training takes 1850.2152254581451 seconds Test average MRR at [99999/100000]: 0.7777071453668755 Test average MR at [99999/100000]: 44.82855377427164 Test average HITS@1 at [99999/100000]: 0.7043134533019587 Test average HITS@3 at [99999/100000]: 0.8327182543041425 Test average HITS@10 at [99999/100000]: 0.8948722723502226 Segmentation fault: 11 Stack trace: [bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2e6b140) [0x7f933e364140] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f9414ae14b0] [bt] (2) /usr/local/cuda/lib64/libcudart.so.10.0(+0x1d9fe) [0x7f94111fd9fe] [bt] (3) /usr/local/cuda/lib64/libcudart.so.10.0(+0x2296b) [0x7f941120296b] [bt] (4) /usr/local/cuda/lib64/libcudart.so.10.0(cudaSetDevice+0x47) [0x7f941122a087] [bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25bf40a) [0x7f933dab840a] [bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25cc21e) [0x7f933dac521e] [bt] (7) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b4d8e) [0x7f933daadd8e] [bt] (8) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b5cd4) [0x7f933daaecd4] ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
