Hi all. If you are working on CPU/GPU performance on CUDA, this might be of interest.
stay well. --bart -------------------------------------------------------------------------- SUMMARY I wanted to share with you a library that can help you accurately profile CUDA applications, giving a complete picture of the time spent in each CUDA library function and, more importantly and interestingly, the amount of synchronization waiting time spent in each function. Most of you are probably familiar with the current limitations of profiling tools for Nvidia GPUs. These tools rely on Nvidia's CUPTI performance data collection framework, which does not provide information on CPU/GPU synchronizations in all but four functions in the CUDA library (libcuda). Among approximately 450 CUDA API functions, CUPTI generates synchronization timing information for only two functions - cuStreamSynchronize and cuCtxSynchronize. Due to these limitations, existing tools provide incomplete synchronization times to the user. DETAILS: Our group has developed a tool that that overcomes these limitations. This tool produces an instrumented version of you CUDA library, producing a profile of your application. You can use this library by itself or, if you are a tool developer, use it to enchance your data collection. The instrumentation is done directly on the library binary code. The instrumented library profiles a CUDA application and produces a list of the CUDA API functions called by the application along with their execution times and time spent by each function in synchronization. The library also supports a callback mechanism to enable tracing at the granularity of a single CUDA function call on a per-thread basis. The library reports data in a way similar to CUPTI. The output is in a CSV format that can either be consumed by a another application for further analysis or can be viewed in a human-friendly way using a script that we provide. This is a new pre-release evaluation version. Please contact us if you have any questions, either to my email or dyninst-api@cs.wisc.edu. And we'd love to have your feedback and suggestions. INSTALLATION AND BUILD: The tool depends on the Dyninst binary instrumentation framework (that can be installed by following the instructions at https://github.com/dyninst/dyninst/wiki/Building-Dyninst), Boost C++ libraries version 1.61 and (of course) a supported Nvidia GPU (4xx series GPU driver versions have been tested). Dyninst can be built and installed using cmake 3.1 or later as follows - $ export LD_LIBRARY_PATH=<DYNINST_INSTALL_PREFIX>/lib/:<BOOST_INSTALL_PREFIX>/install/lib/:/usr/lib/x86_64-linux-gnu/ $ export DYNINSTAPI_RT_LIB=<DYNINST_INSTALL_PREFIX>/lib/libdyninstAPI_RT.so $ git clone https://github.com/dyninst/tools.git $ cd tools/cuda_sync_analyzer $ mkdir build && cd build $ cmake .. \ -DDYNINST_ROOT=<DYNINST_INSTALL_PREFIX> \ -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \ -DBOOST_LIBRARYDIR=<BOOST_INSTALL_PREFIX>/install/lib \ -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo \ -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \ -DCMAKE_INSTALL_PREFIX=<INSTALL_PREFIX> $ make && make install RELEVANT LINKS: Github repository - https://github.com/dyninst/tools/tree/master/cuda_sync_analyzer User manual - https://docs.google.com/document/d/1h12Uq-cQyNSRuajZQo9bhcpFFPZVmL1g-ztCRifze5s PAPERS THAT DESCRIBE THIS WORK: Benjamin Welton and Barton P. Miller, "Exposing Hidden Performance Opportunities in High Performance GPU Applications", 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Washington, DC, May 2018. Best paper award. ftp://ftp.cs.wisc.edu/paradyn/papers/welton-unobvious.pdf Benjamin Welton and Barton P. Miller, "Diogenes: Looking For An Honest CPU/GPU Performance Measurement Tool", Supercomputing 2019 (SC2019), Denver, November 2019. ftp://ftp.cs.wisc.edu/paradyn/technical_papers/diogenes-sc2019.pdf Benjamin Welton and Barton P. Miller, "Identifying and (Automatically) Remedying Performance Problems in CPU/GPU Applications", International Conference on Supercomputing (ICS), Barcelona, Spain, June 2020. ftp://ftp.cs.wisc.edu/paradyn/technical_papers/welton_autocorrect.pdf _______________________________________________ Dyninst-api mailing list Dyninst-api@cs.wisc.edu https://lists.cs.wisc.edu/mailman/listinfo/dyninst-api