Re: [DynInst_API:] mutateLibcuda segfaults

Ilya Zhukov Fri, 29 May 2020 07:15:14 -0700

Hi Ben,

sorry for a long silence. The silence doesn't mean everything went
successfully but was caused by nonavailability of our systems. We had a
security incident and all our systems went offline two weeks ago. Next
week I can (hopefully) test mutateLibcuda again.


Thanks for your reply and I'll notify about my results soonish.

Best wishes,
Ilya

On 11.05.20 22:34, Benjamin Welton wrote:
>> Do I need to use a compute node?
> 
> Yes you will need to use a compute node to run the tool, It executes a
> small cuda program to determine the location of the synchronization
> function in libcuda. Without a CUDA capable graphics card, this test
> program will likely exit immediately and would give the error you are
> seeing. I would try running this first on a compute node before doing
> any other debugging.
> 
> I have submitted a bug report on this issue because we should print a
> warning when the tool is run on a system without a CUDA capable graphics
> card instead of failing with a random error
> ( https://github.com/dyninst/tools/issues/15 ). 
> 
>> X86 with GCC 8.3.0
> 
> This should be fine in terms of there not being any known issues with
> the tool or Dyninst with GCC 8.3. However, I have CC'd Tim Haines on
> here in case there is some issue with Dyninst and GCC 8.3 that I am not
> aware of. 
> 
>> What else can go wrong here?
> 
> There should be no issue. As mentioned, the kernel runtime limit was
> very unlikely to apply to your machine but i figured it was worth
> mentioning in case the machine had some really strange setup.
> 
> Ben
> 
> 
> 
> 
> On Mon, May 11, 2020 at 2:52 PM Ilya Zhukov <i.zhu...@fz-juelich.de
> <mailto:i.zhu...@fz-juelich.de>> wrote:
> 
>     Hello Ben and Nisarg,
> 
>     thank you for your help.
> 
>     > This test program is rewritten by the tool (using dyninst) and
>     executed. Was there a core file that was created for a program
>     called hang_devsync?I do not have any core file for "hang_devsync".
> 
>     > In any case there are three likely causes of this test program
>     crashing: 1) injecting the wrong libcuda.so into the test program.
>     This can occur if a parallel file system is in use and it contains a
>     libcuda that differs from the driver version in use by a compute
>     node (note: despite it's name, libcuda is not part of the CUDA
>     toolkit, it is part of the GPU driver package itself). Check to make
>     sure the libcuda the tool is detecting and injecting into the
>     program matches the libcuda version applications run on the node
>     actually use (simplest way to check this is to manually run
>     hang_devsync on the computer node under GDB and check using info
>     shared what libcuda was dlopen'd by libcudart, this path should
>     match what was displayed by the tool in it's log).
>     In both cases I use the same library. My installation was on the login
>     nodes where I do not have GPUs. Do I need to use a compute node?
> 
>     > 2) Dyninst instrumentation error. What platform (x86,PPC, etc) are
>     you using this tool on? 
>     x86. I use JUWELS [1].
>     > What version of Dyninst are you using?v10.1.0-41-g194dda7
>     > What version of GCC/Clang is being used for compilation of Dyninst?
>     GCC 8.3.0
>     (cmake/make logs in attach)
> 
>     > 3) (unlikely given that you appear to be running on a cluster) as
>     Nisarg mentioned, there is a timeout for cuda kernels that run
>     longer than 5 second on machines that are using the Nvidia card as a
>     display adapter. This is a problem for the test program which spin
>     locks on a single kernel for a long time. You can test if this is an
>     issue by directly launching hang_devsync and seeing if it exits
>     (this program will never return if it is working
>     correctly)."hang_devsync" exits immediately when I execute it. And
>     our GPU experts
>     say that there is no such thing as a kernel runtime limit on JUWELS.
>     What else can go wrong here?
> 
>     Thanks,
>     Ilya
> 
>     [1]
>     
> https://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUWELS/Configuration/Configuration_node.html
> 
>     On 11.05.20 16:33, Benjamin Welton wrote:
>     > Hello llya, 
>     >
>     > As Nisarg mentioned, the likely issue here is that the test
>     program that
>     > is launched to  determine the location of the internal synchronization
>     > function (hang_devsync) did not complete (most likely it crashed). 
>     >
>     > This test program is rewritten by the tool (using dyninst) and
>     executed.
>     > Was there a core file that was created for a program called
>     hang_devsync?/ 
>     >
>     > In any case there are three likely causes of this test program
>     crashing:
>     > 1) injecting the wrong libcuda.so into the test program. This can
>     occur
>     > if a parallel file system is in use and it contains a libcuda that
>     > differs from the driver version in use by a compute node (note:
>     despite
>     > it's name, libcuda is not part of the CUDA toolkit, it is part of the
>     > GPU driver package itself). Check to make sure the libcuda the tool is
>     > detecting and injecting into the program matches the libcuda version
>     > applications run on the node actually use (simplest way to check
>     this is
>     > to manually run hang_devsync on the computer node under GDB and check
>     > using info shared what libcuda was dlopen'd by libcudart, this path
>     > should match what was displayed by the tool in it's log).
>     >
>     > 2) Dyninst instrumentation error. What platform (x86,PPC, etc) are you
>     > using this tool on? What version of Dyninst are you using? What
>     version
>     > of GCC/Clang is being used for compilation of Dyninst?
>     >
>     > 3) (unlikely given that you appear to be running on a cluster) as
>     Nisarg
>     > mentioned, there is a timeout for cuda kernels that run longer than 5
>     > second on machines that are using the Nvidia card as a display
>     adapter.
>     > This is a problem for the test program which spin locks on a single
>     > kernel for a long time. You can test if this is an issue by directly
>     > launching hang_devsync and seeing if it exits (this program will never
>     > return if it is working correctly).
>     >
>     > Ben
>     >
>     > On Mon, May 11, 2020, 12:21 AM NISARG SHAH <nisa...@cs.wisc.edu
>     <mailto:nisa...@cs.wisc.edu>
>     > <mailto:nisa...@cs.wisc.edu <mailto:nisa...@cs.wisc.edu>>> wrote:
>     >
>     >     Thanks Ilya!
>     >
>     >     It looks like the instrumentation that figures out synchronization
>     >     function in CUDA did not run completely to the end (it takes
>     around
>     >     20-30 minutes to finish).
>     >
>     >     Do you know if the segfault occurs immediately (within 4-5s) after
>     >     the last line is printed to screen ("Inserting signal start instra
>     >     in main")? If this is so, the cause of error might be CUDA's
>     kernel
>     >     runtime limit. You might need to increase or disable it
>     altogether.
>     >
>     >
>     >     Regards
>     >     Nisarg
>     >
>     >   
>      ------------------------------------------------------------------------
>     >     *From:* Ilya Zhukov
>     >     *Sent:* Sunday, May 10, 2020 4:52 AM
>     >     *To:* NISARG SHAH; dyninst-api@cs.wisc.edu
>     <mailto:dyninst-api@cs.wisc.edu>
>     >     <mailto:dyninst-api@cs.wisc.edu <mailto:dyninst-api@cs.wisc.edu>>
>     >     *Subject:* Re: [DynInst_API:] mutateLibcuda segfaults
>     >
>     >     Hi Nisarg,
>     >
>     >     I do not have "MS_outputids.bin" directory but I have 5 *.dot
>     files in
>     >     the directory I ran the program.
>     >
>     >     Cheers,
>     >     Ilya
>     >
>     >     On 09.05.20 00:15, NISARG SHAH wrote:
>     >     > Hi Ilya,
>     >     >
>     >     > From the backtrace, it looks like the error is due to the
>     program not
>     >     > being able to read from a temporary file "MS_outputids.bin"
>     that is
>     >     > creates initially. Can you check if it exists in the
>     directory from
>     >     > where you ran the program? Also, can you check if 5 *.dot
>     files are
>     >     > present in the same directory?
>     >     >
>     >     > Thanks
>     >     > Nisarg
>     >     >
>     >     >
>     ------------------------------------------------------------------------
>     >     > *From:* Dyninst-api <dyninst-api-boun...@cs.wisc.edu
>     <mailto:dyninst-api-boun...@cs.wisc.edu>
>     >     <mailto:dyninst-api-boun...@cs.wisc.edu
>     <mailto:dyninst-api-boun...@cs.wisc.edu>>> on behalf of Ilya
>     >     > Zhukov <i.zhu...@fz-juelich.de
>     <mailto:i.zhu...@fz-juelich.de> <mailto:i.zhu...@fz-juelich.de
>     <mailto:i.zhu...@fz-juelich.de>>>
>     >     > *Sent:* Wednesday, May 6, 2020 7:16 AM
>     >     > *To:* dyninst-api@cs.wisc.edu
>     <mailto:dyninst-api@cs.wisc.edu> <mailto:dyninst-api@cs.wisc.edu
>     <mailto:dyninst-api@cs.wisc.edu>>
>     >     <dyninst-api@cs.wisc.edu <mailto:dyninst-api@cs.wisc.edu>
>     <mailto:dyninst-api@cs.wisc.edu <mailto:dyninst-api@cs.wisc.edu>>>
>     >     > *Subject:* [DynInst_API:] mutateLibcuda segfaults
>     >     >  
>     >     > Dear dyinst developers,
>     >     >
>     >     > I'm testing your cuda_sync_analyze tool on our cluster for
>     CUDA/10.1.105. <http://10.1.105.> <http://10.1.105.>
>     >     >
>     >     > I installed dyinst and cuda_sync_analyze (cmake and make
>     logs in attach)
>     >     > successfully. But I get segmentation fault when I create
>     fake CUDA library.
>     >     >
>     >     > Here is a backtrace
>     >     >> #0  0x00002b0a9658c4bc in fseek () from /usr/lib64/libc.so.6
>     >     >> #1  0x00002b0a93b7eb29 in
>     LaunchIdentifySync::PostProcessing (this=this@entry=0x7fff1af88af0,
>     allFound=...) at
>     
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/LaunchIdentifySync.cpp:90
>     >     >> #2  0x00002b0a93b7c00f in
>     CSA_FindSyncAddress(std::__cxx11::basic_string<char,
>     std::char_traits<char>, std::allocator<char> >&) () at
>     
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/FindCudaSync.cpp:34
>     >     >> #3  0x00000000004021fb in main () at
>     
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/main.cpp:15
>     >     >> #4  0x00002b0a96537505 in __libc_start_main () from
>     /usr/lib64/libc.so.6
>     >     >> #5  0x000000000040253e in _start () at
>     
> /p/project/cslts/zhukov1/work/tools/dyninst/tools/cuda_sync_analyzer/src/main.cpp:38
>     >     >
>     >     > Any help will be appreciated. If you need anything else let
>     me know.
>     >     >
>     >     > Best wishes,
>     >     > Ilya
>     >     > --
>     >     > Ilya Zhukov
>     >     > Juelich Supercomputing Centre
>     >     > Institute for Advanced Simulation
>     >     > Forschungszentrum Juelich GmbH
>     >     > 52425 Juelich, Germany
>     >     >
>     >     > Phone: +49-2461-61-2054
>     >     > Fax: +49-2461-61-2810
>     >     > E-mail: i.zhu...@fz-juelich.de
>     <mailto:i.zhu...@fz-juelich.de> <mailto:i.zhu...@fz-juelich.de
>     <mailto:i.zhu...@fz-juelich.de>>
>     >     > WWW: http://www.fz-juelich.de/jsc
>     >
>

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Dyninst-api mailing list
Dyninst-api@cs.wisc.edu
https://lists.cs.wisc.edu/mailman/listinfo/dyninst-api

Re: [DynInst_API:] mutateLibcuda segfaults

Reply via email to