Hi Ben, sorry for a long silence. The silence doesn't mean everything went successfully but was caused by nonavailability of our systems. We had a security incident and all our systems went offline two weeks ago. Next week I can (hopefully) test mutateLibcuda again.
Thanks for your reply and I'll notify about my results soonish. Best wishes, Ilya On 11.05.20 22:34, Benjamin Welton wrote: >> Do I need to use a compute node? > > Yes you will need to use a compute node to run the tool, It executes a > small cuda program to determine the location of the synchronization > function in libcuda. Without a CUDA capable graphics card, this test > program will likely exit immediately and would give the error you are > seeing. I would try running this first on a compute node before doing > any other debugging. > > I have submitted a bug report on this issue because we should print a > warning when the tool is run on a system without a CUDA capable graphics > card instead of failing with a random error > ( https://github.com/dyninst/tools/issues/15 ). > >> X86 with GCC 8.3.0 > > This should be fine in terms of there not being any known issues with > the tool or Dyninst with GCC 8.3. However, I have CC'd Tim Haines on > here in case there is some issue with Dyninst and GCC 8.3 that I am not > aware of. > >> What else can go wrong here? > > There should be no issue. As mentioned, the kernel runtime limit was > very unlikely to apply to your machine but i figured it was worth > mentioning in case the machine had some really strange setup. > > Ben > > > > > On Mon, May 11, 2020 at 2:52 PM Ilya Zhukov <i.zhu...@fz-juelich.de > <mailto:i.zhu...@fz-juelich.de>> wrote: > > Hello Ben and Nisarg, > > thank you for your help. > > > This test program is rewritten by the tool (using dyninst) and > executed. Was there a core file that was created for a program > called hang_devsync?I do not have any core file for "hang_devsync". > > > In any case there are three likely causes of this test program > crashing: 1) injecting the wrong libcuda.so into the test program. > This can occur if a parallel file system is in use and it contains a > libcuda that differs from the driver version in use by a compute > node (note: despite it's name, libcuda is not part of the CUDA > toolkit, it is part of the GPU driver package itself). Check to make > sure the libcuda the tool is detecting and injecting into the > program matches the libcuda version applications run on the node > actually use (simplest way to check this is to manually run > hang_devsync on the computer node under GDB and check using info > shared what libcuda was dlopen'd by libcudart, this path should > match what was displayed by the tool in it's log). > In both cases I use the same library. My installation was on the login > nodes where I do not have GPUs. Do I need to use a compute node? > > > 2) Dyninst instrumentation error. What platform (x86,PPC, etc) are > you using this tool on? > x86. I use JUWELS [1]. > > What version of Dyninst are you using?v10.1.0-41-g194dda7 > > What version of GCC/Clang is being used for compilation of Dyninst? > GCC 8.3.0 > (cmake/make logs in attach) > > > 3) (unlikely given that you appear to be running on a cluster) as > Nisarg mentioned, there is a timeout for cuda kernels that run > longer than 5 second on machines that are using the Nvidia card as a > display adapter. This is a problem for the test program which spin > locks on a single kernel for a long time. You can test if this is an > issue by directly launching hang_devsync and seeing if it exits > (this program will never return if it is working > correctly)."hang_devsync" exits immediately when I execute it. And > our GPU experts > say that there is no such thing as a kernel runtime limit on JUWELS. > What else can go wrong here? > > Thanks, > Ilya > > [1] > > https://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUWELS/Configuration/Configuration_node.html > > On 11.05.20 16:33, Benjamin Welton wrote: > > Hello llya, > > > > As Nisarg mentioned, the likely issue here is that the test > program that > > is launched to determine the location of the internal synchronization > > function (hang_devsync) did not complete (most likely it crashed). > > > > This test program is rewritten by the tool (using dyninst) and > executed. > > Was there a core file that was created for a program called > hang_devsync?/ > > > > In any case there are three likely causes of this test program > crashing: > > 1) injecting the wrong libcuda.so into the test program. This can > occur > > if a parallel file system is in use and it contains a libcuda that > > differs from the driver version in use by a compute node (note: > despite > > it's name, libcuda is not part of the CUDA toolkit, it is part of the > > GPU driver package itself). Check to make sure the libcuda the tool is > > detecting and injecting into the program matches the libcuda version > > applications run on the node actually use (simplest way to check > this is > > to manually run hang_devsync on the computer node under GDB and check > > using info shared what libcuda was dlopen'd by libcudart, this path > > should match what was displayed by the tool in it's log). > > > > 2) Dyninst instrumentation error. What platform (x86,PPC, etc) are you > > using this tool on? What version of Dyninst are you using? What > version > > of GCC/Clang is being used for compilation of Dyninst? > > > > 3) (unlikely given that you appear to be running on a cluster) as > Nisarg > > mentioned, there is a timeout for cuda kernels that run longer than 5 > > second on machines that are using the Nvidia card as a display > adapter. > > This is a problem for the test program which spin locks on a single > > kernel for a long time. You can test if this is an issue by directly > > launching hang_devsync and seeing if it exits (this program will never > > return if it is working correctly). > > > > Ben > > > > On Mon, May 11, 2020, 12:21 AM NISARG SHAH <nisa...@cs.wisc.edu > <mailto:nisa...@cs.wisc.edu> > > <mailto:nisa...@cs.wisc.edu <mailto:nisa...@cs.wisc.edu>>> wrote: > > > > Thanks Ilya! > > > > It looks like the instrumentation that figures out synchronization > > function in CUDA did not run completely to the end (it takes > around > > 20-30 minutes to finish). > > > > Do you know if the segfault occurs immediately (within 4-5s) after > > the last line is printed to screen ("Inserting signal start instra > > in main")? If this is so, the cause of error might be CUDA's > kernel > > runtime limit. You might need to increase or disable it > altogether. > > > > > > Regards > > Nisarg > > > > > ------------------------------------------------------------------------ > > *From:* Ilya Zhukov > > *Sent:* Sunday, May 10, 2020 4:52 AM > > *To:* NISARG SHAH; dyninst-api@cs.wisc.edu > <mailto:dyninst-api@cs.wisc.edu> > > <mailto:dyninst-api@cs.wisc.edu <mailto:dyninst-api@cs.wisc.edu>> > > *Subject:* Re: [DynInst_API:] mutateLibcuda segfaults > > > > Hi Nisarg, > > > > I do not have "MS_outputids.bin" directory but I have 5 *.dot > files in > > the directory I ran the program. > > > > Cheers, > > Ilya > > > > On 09.05.20 00:15, NISARG SHAH wrote: > > > Hi Ilya, > > > > > > From the backtrace, it looks like the error is due to the > program not > > > being able to read from a temporary file "MS_outputids.bin" > that is > > > creates initially. Can you check if it exists in the > directory from > > > where you ran the program? Also, can you check if 5 *.dot > files are > > > present in the same directory? > > > > > > Thanks > > > Nisarg > > > > > > > ------------------------------------------------------------------------ > > > *From:* Dyninst-api <dyninst-api-boun...@cs.wisc.edu > <mailto:dyninst-api-boun...@cs.wisc.edu> > > <mailto:dyninst-api-boun...@cs.wisc.edu > <mailto:dyninst-api-boun...@cs.wisc.edu>>> on behalf of Ilya > > > Zhukov <i.zhu...@fz-juelich.de > <mailto:i.zhu...@fz-juelich.de> <mailto:i.zhu...@fz-juelich.de > <mailto:i.zhu...@fz-juelich.de>>> > > > *Sent:* Wednesday, May 6, 2020 7:16 AM > > > *To:* dyninst-api@cs.wisc.edu > <mailto:dyninst-api@cs.wisc.edu> <mailto:dyninst-api@cs.wisc.edu > <mailto:dyninst-api@cs.wisc.edu>> > > <dyninst-api@cs.wisc.edu <mailto:dyninst-api@cs.wisc.edu> > <mailto:dyninst-api@cs.wisc.edu <mailto:dyninst-api@cs.wisc.edu>>> > > > *Subject:* [DynInst_API:] mutateLibcuda segfaults > > > > > > Dear dyinst developers, > > > > > > I'm testing your cuda_sync_analyze tool on our cluster for > CUDA/10.1.105. <http://10.1.105.> <http://10.1.105.> > > > > > > I installed dyinst and cuda_sync_analyze (cmake and make > logs in attach) > > > successfully. But I get segmentation fault when I create > fake CUDA library. > > > > > > Here is a backtrace > > >> #0 0x00002b0a9658c4bc in fseek () from /usr/lib64/libc.so.6 > > >> #1 0x00002b0a93b7eb29 in > LaunchIdentifySync::PostProcessing (this=this@entry=0x7fff1af88af0, > allFound=...) at > > /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/LaunchIdentifySync.cpp:90 > > >> #2 0x00002b0a93b7c00f in > CSA_FindSyncAddress(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> >&) () at > > /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/FindCudaSync.cpp:34 > > >> #3 0x00000000004021fb in main () at > > /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/main.cpp:15 > > >> #4 0x00002b0a96537505 in __libc_start_main () from > /usr/lib64/libc.so.6 > > >> #5 0x000000000040253e in _start () at > > /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/main.cpp:38 > > > > > > Any help will be appreciated. If you need anything else let > me know. > > > > > > Best wishes, > > > Ilya > > > -- > > > Ilya Zhukov > > > Juelich Supercomputing Centre > > > Institute for Advanced Simulation > > > Forschungszentrum Juelich GmbH > > > 52425 Juelich, Germany > > > > > > Phone: +49-2461-61-2054 > > > Fax: +49-2461-61-2810 > > > E-mail: i.zhu...@fz-juelich.de > <mailto:i.zhu...@fz-juelich.de> <mailto:i.zhu...@fz-juelich.de > <mailto:i.zhu...@fz-juelich.de>> > > > WWW: http://www.fz-juelich.de/jsc > > >
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Dyninst-api mailing list Dyninst-api@cs.wisc.edu https://lists.cs.wisc.edu/mailman/listinfo/dyninst-api