mjsML commented on issue #17238: Error building on ubuntu 18.04 on x64 with intel XEP, CUDA10.0, NCCL and TensorRT URL: https://github.com/apache/incubator-mxnet/issues/17238#issuecomment-572012537 @leezu The pthread errors and had to do with some missing libs... This [page](https://nextjournal.com/mpd/compiling-mxnet) specifically running apt with these packages: `apt-get install --no-install-recommends \ software-properties-common apt-transport-https \ build-essential cmake libjemalloc-dev \ libatlas-base-dev liblapack-dev liblapacke-dev libopenblas-dev libopencv-dev \ libcurl4-openssl-dev libzmq3-dev ninja-build libhdf5-dev libomp-dev` fixed the errors related to those but then the assert.h inside MKLDNN didn't go away. After few hours banging my head against the wall, I got to the root cause ... MKLDNN has a few ["header leaks"](https://jira.mongodb.org/browse/CXX-1423) which means I had to manually go in the MKLDNN src and add #include<assert.h> and the likes. That fixed the assert.h error .. but then a whole bunch of other leaks sprung because of improper header referencing all across MKLDNN of their own internal headers. I ended up ditching MKLDNN (by setting -DUSE_MKLDNN=0) because my main target is to get a fast GPU build that utilizes the Nvidia packages (mainly NCCL and TensorRT on x64 with my desktop training machine and TensorRT on aarch64 with jetson Nano) ... When I have more time I'll pull on the intel repo to fix the header issues, however I'm not sure how you sync the 3rd party folder with the source? Now the build passes but the tests fail with linking errors ... I did specify that I wanted to use MKL for BLAS but I'm getting this when linking the examples / tests: `//usr/lib/x86_64/-linuxusr-/gnubin//libblas.so.3ld:: errorlibmxnet.a (addingla_op.cc.o )symbols:: undefinedDSO referencemissing tofrom symbolcommand 'linecblas_dtrsm ' //usr/lib/x86_64-linux-gnu/libblas.so.3: error adding symbols: DSO missing from command collect2: error: ld returned 1 exit status line` I then built OpenBLAS latest (0.3.8) and updated the symbols by updating alternatives as [instructions](https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts): `sudo update-alternatives --install /usr/lib/libblas.so.3 libblas.so.3 /opt/OpenBLAS/lib/libopenblas.so.0 41 \ --slave /usr/lib/liblapack.so.3 liblapack.so.3 /opt/OpenBLAS/lib/libopenblas.so.0` This was futile as well as I'm still stuck with the same linking error ... I'm at loss why do we need to link OpenBLAS in the first place if I built with MKL as my BLAS of choice. Also sure I'd love to contribute to the build process or otherwise :) ... imho we need a few new build types actually ... my suggestions are Training and Inference by accelerator type (in this example, I'm building a CUDA training build (CUDA, cuDNN, NCCL, and TensorRT), while an Inference build would have CUDA, cuDNN and TesnorRT only (in my case edge accelerator on aarch64 too :/ ) ... [Insert accelerator type here] need also the same ... in my mind this is like the "Debug" and "Release" config in the ML world. Food for thought.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
