ChaiBapchya commented on issue #16603: Significant slowdown in some DGL models
URL: 
https://github.com/apache/incubator-mxnet/issues/16603#issuecomment-547685567
 
 
   Tried reproducing
   Got a segfault 
   
   Build flags
   ```
   [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ 
CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✖ 
OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ 
BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔ DIST_KVSTORE, 
✖ CXX14, ✖ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG]
   ```
   
   DGL & MXNet versions
   ```
   python3 -c "import mxnet;import 
dgl;print(mxnet.__version__);print(dgl.__version__)"
   1.5.1
   0.4
   ```
   
   Log
   ```
   /workspace/dgl_issue/dgl/apps/kg$ DGLBACKEND=mxnet python3 train.py --model 
DistMult --dataset FB15k --batch_size 1024     --neg_sample_size 256 
--hidden_dim 2000 --gam
   ma 500.0 --lr 0.1 --max_step 100000     --batch_size_eval 16 --gpu 0 --valid 
--test -adv
   Logs are being recorded at: ckpts/DistMult_FB15k_1/train.log
   File not found. Downloading from 
https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/FB15k.zip
   8it [00:00, 11.36it/s]
   Download finished. Unzipping the file...
   Unzip finished.
   |Train|: 483142
   |valid|: 50000
   |test|: 59071
   eval on 50000 edges
   eval on 50000 edges
   eval on 59071 edges
   eval on 59071 edges
   [Train](0/100000) average pos_loss: 0.6942696571350098
   [Train](0/100000) average neg_loss: 0.7048686146736145
   [Train](0/100000) average loss: 0.6995691061019897
   [Train](0/100000) average regularization: 0.05789175629615784
   0.23716044425964355
   [Train](1000/100000) average pos_loss: 0.2992950127273798
   [Train](1000/100000) average neg_loss: 0.571561429977417
   [Train](1000/100000) average loss: 0.43542822140455245
   [Train](1000/100000) average regularization: 0.12885531912744044
   16.814379453659058
   ```
   ...
   Trains for rest of the steps
   ```
   [Train](99000/100000) average pos_loss: 0.18922672630846502
   [Train](99000/100000) average neg_loss: 0.3200621349364519
   [Train](99000/100000) average loss: 0.2546444304138422
   [Train](99000/100000) average regularization: 0.10238696759194135
   13.827665090560913
   training takes 1850.2152254581451 seconds
   Test average MRR at [99999/100000]: 0.7777071453668755
   Test average MR at [99999/100000]: 44.82855377427164
   Test average HITS@1 at [99999/100000]: 0.7043134533019587
   Test average HITS@3 at [99999/100000]: 0.8327182543041425
   Test average HITS@10 at [99999/100000]: 0.8948722723502226
   
   Segmentation fault: 11
   
   Stack trace:
     [bt] (0) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2e6b140)
 [0x7f933e364140]
     [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f9414ae14b0]
     [bt] (2) /usr/local/cuda/lib64/libcudart.so.10.0(+0x1d9fe) [0x7f94111fd9fe]
     [bt] (3) /usr/local/cuda/lib64/libcudart.so.10.0(+0x2296b) [0x7f941120296b]
     [bt] (4) /usr/local/cuda/lib64/libcudart.so.10.0(cudaSetDevice+0x47) 
[0x7f941122a087]
     [bt] (5) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25bf40a)
 [0x7f933dab840a]
     [bt] (6) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25cc21e)
 [0x7f933dac521e]
     [bt] (7) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b4d8e)
 [0x7f933daadd8e]
     [bt] (8) 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b5cd4)
 [0x7f933daaecd4]
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to