timmeh87 opened a new issue #20954:
URL: https://github.com/apache/incubator-mxnet/issues/20954


   **This bug is relevant to 2.0-RC0 which I am using, but as far as I 
understand it, it applies to all previous versions and as people update their 
python version, all the older versions of the mxnet build will also break.** 
   
   ## Description
   For about a week I have been pulling out hair trying to get mxnet to build. 
There are actually several bugs with building on vs2019 that prevent it from 
completing and I had to make modifications to the source code:
   
   1) replace floor() with floorf(), ceil with ceilf(), round() with roundf() 
everywhere (As per a discussion on the blender forums)
   https://devtalk.blender.org/t/cuda-compile-error-windows-10/17886/5
   2) instantiate 6 templates as per #17633
   3) sneak lapack.lib into the linker because for some reason the cmake script 
assumes windows does not have it and excludes it on purpose (I got my OpenBLAS 
and Lapack from vcpkg if that matters)
   
   I used the same flags as the windows GPU CI script, except I also enabled 
the USE_CPP_PACKAGE because I require it. After doing those things the 
libmxnet.dll is produced, but the build fails in python when generating the 
opwrapper
   
   Three other things that are very important about my system: 
   1) Python 3.10
   2) I have a gtx660 installed, which is a device that supports maximum 
mxnet_30
   3) I have CUDA 11.2 installed, because I am trying to make builds for 
someone else. This machine is a powerful build machine so we dont have to do 
builds at home on our laptops. I am not going to use MXNet on my 660, it just 
happens to be there on the build machine.
   
   Ok so when the build gets to python, it fails in two different ways due to 
two different problems
   1) python >= 3.8 changed the search path for dlls, it no longer searches 
PATH so when loading libmxnet.dll, it fails because the cudart DLL is missing. 
   
   The error message is 
   FileNotFoundError: Could not find module 
'F:\incubator-mxnet\build\Debug\libmxnet.dll' (or one of its dependencies). Try 
using the full path with constructor syntax.
   
   The fix to this, a call to  "os.add_dll_directory" MUST be used for python 
versions >= 3.8
   https://docs.python.org/3/library/os.html#os.add_dll_directory
   https://issueantenna.com/repo/academysoftwarefoundation/imath/issues/238
   
   2) after loading the cuda dlls, the code now crashes with a mysterious error 
about "writing 000000000" when in fact it is trying to load a version of 
mxnet_xx.dll which does not exist as a file on this computer. During 
automatically determining the version to use, it arrives at a DLL which does 
not exist and tries to load it anyways
   
   ### Error Message
   
   **first:**
   `FileNotFoundError: Could not find module 
'F:\incubator-mxnet\build\Debug\libmxnet.dll' (or one of its dependencies). Try 
using the full path with constructor syntax.`
   
   **after adding cudart_xxx to the search path (this message is from python 
10, it might vary based on your python version):**
   `OSError: exception: access violation writing 0x0000000000000000`
   
   ### Steps to reproduce
   
   - Be on windows 10 
   - Build with CPP_PACKAGE and CUDA enabled
   - Have a python version >3.8 (bug in mxnet)
     OR
   - Have a GPU installed that is so old the installed CUDA version does not 
support it (needs documentation in mxnet)
   
   ## My Solution
   The solution for cudart DLL:
   - if you just need to to work, copy the dll from the CUDA folder to right 
beside libmxnet.dll
   - to fix this in the python code the python version has to be detected and 
then call "os.add_dll_directory" with the cuda folder
   
   The solution to proceed with old hardware:
   - go into "warp_dll.cpp" and change the code that detects the version of 
mxnet_xx.dll. Hard-code it to the version that you have built, for example 
        'wsprintfW(dll_name, L"mxnet_52.dll", version);'
   - re-run the build, it should rebuild libmxnet.dll, then the python should 
execute
   - undo the change to the code, and rebuild again to go back to a normal 
binary that will detect the correct version of mxnet based on hardware (as long 
as you HAVE the hardware 🙃)
   
   ## Better solutions?
   - the solution to find cudart dll from python is already "better" and should 
be implemented ASAP, python 3.7 is the oldest version available on the windows 
store and will soon be deprecated
   - I think that it should be documented right at the top of the main CPP 
build instructions that having supported hardware is a hard requirement. It 
might seem obvious to some people but there are situations where people like me 
might want to build the code even with no GPU at all. what would happen then? I 
dont know, someone that knows should write that into the instructions
   - Mxnet could try a little harder when it is searching for DLLs, after 
finding out that it is about to load a DLL that does not exist, maybe spit out 
an error message or something with the name of the dll and a message to get 
better hardware or rebuild
   - The build could pull the same screwy crap that i tried to do to enable 
building on any platform with a non-suitable GPU - but maybe thats complicated?
   - Mxnet could look at some configuration file when deciding what core DLL to 
choose and if there is an overriding value it just uses that instead of 
autodetect.
   
   - note: you might have been wondering "why not just build mxnet_30" and the 
answer is because it was deprecated in cuda 11.2, so when you try to do that, 
it fails and says "that is fully deprecated". I would have to uninstall it and 
install cuda 10. 
   
   ## Environment
   
   <details>
   <summary>Environment Information</summary>
   
   ```
   ----------Python Info----------
   Version      : 3.10.2
   Compiler     : MSC v.1929 64 bit (AMD64)
   Build        : ('tags/v3.10.2:a58ebcc', 'Jan 17 2022 14:12:15')
   Arch         : ('64bit', 'WindowsPE')
   ```
   
   </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to