[GitHub] [incubator-mxnet] mjsML commented on issue #17238: Error building on ubuntu 18.04 on x64 with intel XEP, CUDA10.0, NCCL and TensorRT

GitBox Wed, 08 Jan 2020 03:42:17 -0800

mjsML commented on issue #17238: Error building on ubuntu 18.04 on x64 with 
intel XEP, CUDA10.0, NCCL and TensorRT
URL: 
https://github.com/apache/incubator-mxnet/issues/17238#issuecomment-572012537
 
 
   @leezu 
   The pthread errors and had to do with some missing libs... This 
[page](https://nextjournal.com/mpd/compiling-mxnet) specifically running apt 
with these packages:
   
   `apt-get install --no-install-recommends \
     software-properties-common apt-transport-https \
     build-essential cmake libjemalloc-dev \
     libatlas-base-dev liblapack-dev liblapacke-dev libopenblas-dev 
libopencv-dev \
     libcurl4-openssl-dev libzmq3-dev ninja-build libhdf5-dev libomp-dev` fixed 
the errors related to those but then the assert.h inside MKLDNN didn't go away. 
   
   After few hours banging my head against the wall, I got to the root cause 
... MKLDNN has a few ["header leaks"](https://jira.mongodb.org/browse/CXX-1423) 
which means I had to manually go in the MKLDNN src and add #include<assert.h> 
and the likes. 
   That fixed the assert.h error .. but then a whole bunch of other leaks 
sprung because of improper header referencing all across MKLDNN of their own 
internal headers.
   
   I ended up ditching MKLDNN (by setting -DUSE_MKLDNN=0) because my main 
target is to get a fast GPU build that utilizes the Nvidia packages (mainly 
NCCL and TensorRT on x64 with my desktop training machine and TensorRT on 
aarch64 with jetson Nano) ... When I have more time I'll pull on the intel repo 
to fix the header issues, however I'm not sure how you sync the 3rd party 
folder with the source?
   Now the build passes but the tests fail with linking errors ... I did 
specify that I wanted to use MKL for BLAS but I'm getting this when linking the 
examples / tests: 
   `//usr/lib/x86_64/-linuxusr-/gnubin//libblas.so.3ld::  errorlibmxnet.a 
(addingla_op.cc.o )symbols::  undefinedDSO  referencemissing  tofrom  
symbolcommand  'linecblas_dtrsm
   '
   //usr/lib/x86_64-linux-gnu/libblas.so.3: error adding symbols: DSO missing 
from command collect2: error: ld returned 1 exit status
   line`
   I then built OpenBLAS latest (0.3.8) and updated the symbols by updating 
alternatives as 
[instructions](https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts):
   `sudo update-alternatives --install /usr/lib/libblas.so.3 libblas.so.3 
/opt/OpenBLAS/lib/libopenblas.so.0 41 \
      --slave /usr/lib/liblapack.so.3 liblapack.so.3 
/opt/OpenBLAS/lib/libopenblas.so.0`
   This was futile as well as I'm still stuck with the same linking error ... 
I'm at loss why do we need to link OpenBLAS in the first place if I built with 
MKL as my BLAS of choice.
   
   Also sure I'd love to contribute to the build process or otherwise :) ... 
imho we need a few new build types actually ... my suggestions are Training and 
Inference by accelerator type (in this example, I'm building a CUDA training 
build (CUDA, cuDNN, NCCL, and TensorRT), while an Inference build would have 
CUDA, cuDNN and TesnorRT only (in my case edge accelerator on aarch64 too :/ ) 
... [Insert accelerator type here] need also the same ... in my mind this is 
like the "Debug" and "Release" config in the ML world. Food for thought.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] mjsML commented on issue #17238: Error building on ubuntu 18.04 on x64 with intel XEP, CUDA10.0, NCCL and TensorRT

Reply via email to