> On Mar 21, 2016, at 9:50 PM, Satish Balay <[email protected]> wrote: > > BTW: perils of using 'gitcommit=origin/master' > http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2016/03/21/master.html > > Perhaps we should switch superlu_dist to use a working snapshot? > self.gitcommit = '35c3b21630d93b3f8392a68e607467c247b5e053' > > balay@asterix /home/balay/petsc (master=) > $ git grep origin/master config > config/BuildSystem/config/packages/Chombo.py: self.gitcommit = > 'origin/master' > config/BuildSystem/config/packages/SuperLU_DIST.py: self.gitcommit > = 'origin/master' > config/BuildSystem/config/packages/amanzi.py: self.gitcommit = > 'origin/master' > config/BuildSystem/config/packages/saws.py: self.gitcommit = > 'origin/master'
Satish and Hong, Sherry has changed SuperLU_dist to have no name conflicts with SuperLU this means we need to update the SuperLU_dist interface for fix these problems. Once things have settled down we can use a release commit instead of master. Barry > > Satish > > On Mon, 21 Mar 2016, Satish Balay wrote: > >> Hm - get a gtx 950 [2GB] and replace? [or gtx 970 4GB?] >> >> http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-950/specifications >> http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications >> >> There is a different machine with 2 M2090 cards - so I'll switch the >> builds to that [es.mcs.anl.gov]. I was previously avoiding builds on >> it - as its used as a general use machine [and perhaps ocassionally >> for benchmark runs] >> >> Satish >> >> On Mon, 21 Mar 2016, Dominic Meiser wrote: >> >>> Hi Jiri, >>> >>> Thanks very much for the fast response. That's very useful >>> information. I had no idea the memory footprint of the contexts >>> was this large. >>> >>> Satish, Barry is there any chance we can upgrade the GPU in the >>> test machine to at least Fermi generation? That way I can help >>> much more easily because I'd be able to reproduce your setup >>> locally. >>> >>> Cheers, >>> Dominic >>> >>> On Mon, Mar 21, 2016 at 08:01:10PM +0000, Jiri Kraus wrote: >>>> Hi Dominic, >>>> >>>> I think the error messages you get is pretty descriptive regarding the >>>> root cause. You are probably running out of GPU memory. Since you are >>>> running on a GTX 285 you can't use MPS [1] therefore each MPI process has >>>> its own context on the GPU. Each context needs to initialize some data on >>>> the GPU (used for local variables and so on). The required amount needed >>>> for this depends on the size of the GPUs (essentially correlates with the >>>> maximum number of concurrently active threads). This can easily be >>>> 50-100MB. So with only 1GB of GPU memory you are probably using all GPUs >>>> memory for context data and nothing is available for your application. >>>> Unfortunately there is no good way to debug this with GeForce. On Tesla >>>> nvidia-smi does show you all processes that have a context on a GPU >>>> together with their memory consumption. >>>> >>>> Hope this helps >>>> >>>> Jiri >>>> >>>> >>>> [1] https://docs.nvidia.com/deploy/mps/index.html >>>> >>>>> -----Original Message----- >>>>> From: Dominic Meiser [mailto:[email protected]] >>>>> Sent: Montag, 21. März 2016 19:17 >>>>> To: Jiri Kraus <[email protected]> >>>>> Cc: Karl Rupp <[email protected]>; Barry Smith <[email protected]>; >>>>> [email protected] >>>>> Subject: mpi/cuda issue >>>>> >>>>> Hi Jiri, >>>>> >>>>> Hope things are going well. We are trying to understand an >>>>> mpi+cuda issue in the tests of the PETSc library and I was >>>>> wondering if you could help us out. >>>>> >>>>> The behavior we're seeing is that some of the tests fail intermittently >>>>> with >>>>> "out of memory" errors, e.g. >>>>> >>>>> terminate called after throwing an instance of >>>>> 'thrust::system::detail::bad_alloc' >>>>> what(): std::bad_alloc: out of memory >>>>> >>>>> Other tests hang when we oversubscribe the GPU with a largish number of >>>>> MPI processes (32 in one case). Satish obtained info on the GPU >>>>> configuration using nvidia-smi below. >>>>> >>>>> Could you remind us what the requirements for MPI+cuda are, especially >>>>> regarding over subscription? >>>>> >>>>> Are there any other tools we can use to debug this problem? Any >>>>> suggestions on what we should look at next? >>>>> >>>>> Thanks very much in advance. >>>>> Cheers, >>>>> Dominic >>>>> >>>>> >>>>> >>>>> On Mon, Mar 21, 2016 at 01:09:14PM -0500, Satish Balay wrote: >>>>>> balay@frog ~ $ nvidia-smi >>>>>> Mon Mar 21 13:07:36 2016 >>>>>> +------------------------------------------------------+ >>>>>> | NVIDIA-SMI 340.93 Driver Version: 340.93 | >>>>>> |-------------------------------+----------------------+----------------------+ >>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >>>>>> Uncorr. ECC | >>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >>>>> Compute M. | >>>>>> >>>>> |===============================+======================+====== >>>>> ================| >>>>>> | 0 GeForce GTX 285 Off | 0000:03:00.0 N/A | >>>>>> N/A | >>>>>> | 40% 66C P0 N/A / N/A | 3MiB / 1023MiB | N/A >>>>>> Default | >>>>>> +-------------------------------+----------------------+----------------------+ >>>>>> >>>>>> +-----------------------------------------------------------------------------+ >>>>>> | Compute processes: GPU >>>>>> Memory | >>>>>> | GPU PID Process name Usage >>>>>> | >>>>>> >>>>> |============================================================= >>>>> ================| >>>>>> | 0 Not Supported >>>>>> | >>>>>> +-----------------------------------------------------------------------------+ >>>>>> >>>>>> >>>>>> balay@frog ~/soft/NVIDIA_CUDA-5.5_Samples/bin/x86_64/linux/release $ >>>>>> ./deviceQuery ./deviceQuery Starting... >>>>>> >>>>>> CUDA Device Query (Runtime API) version (CUDART static linking) >>>>>> >>>>>> Detected 1 CUDA Capable device(s) >>>>>> >>>>>> Device 0: "GeForce GTX 285" >>>>>> CUDA Driver Version / Runtime Version 6.5 / 5.5 >>>>>> CUDA Capability Major/Minor version number: 1.3 >>>>>> Total amount of global memory: 1024 MBytes (1073414144 >>>>> bytes) >>>>>> (30) Multiprocessors, ( 8) CUDA Cores/MP: 240 CUDA Cores >>>>>> GPU Clock rate: 1476 MHz (1.48 GHz) >>>>>> Memory Clock rate: 1242 Mhz >>>>>> Memory Bus Width: 512-bit >>>>>> Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, >>>>> 32768), 3D=(2048, 2048, 2048) >>>>>> Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers >>>>>> Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 >>>>> layers >>>>>> Total amount of constant memory: 65536 bytes >>>>>> Total amount of shared memory per block: 16384 bytes >>>>>> Total number of registers available per block: 16384 >>>>>> Warp size: 32 >>>>>> Maximum number of threads per multiprocessor: 1024 >>>>>> Maximum number of threads per block: 512 >>>>>> Max dimension size of a thread block (x,y,z): (512, 512, 64) >>>>>> Max dimension size of a grid size (x,y,z): (65535, 65535, 1) >>>>>> Maximum memory pitch: 2147483647 bytes >>>>>> Texture alignment: 256 bytes >>>>>> Concurrent copy and kernel execution: Yes with 1 copy engine(s) >>>>>> Run time limit on kernels: No >>>>>> Integrated GPU sharing Host Memory: No >>>>>> Support host page-locked memory mapping: Yes >>>>>> Alignment requirement for Surfaces: Yes >>>>>> Device has ECC support: Disabled >>>>>> Device supports Unified Addressing (UVA): No >>>>>> Device PCI Bus ID / PCI location ID: 3 / 0 >>>>>> Compute Mode: >>>>>> < Default (multiple host threads can use ::cudaSetDevice() with >>>>>> device simultaneously) > >>>>>> >>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA >>>>>> Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce GTX 285 Result = >>>>>> PASS balay@frog >>>>>> ~/soft/NVIDIA_CUDA-5.5_Samples/bin/x86_64/linux/release $ >>>>>> >>>>>> >>>>>> On Mon, 21 Mar 2016, Dominic Meiser wrote: >>>>>> >>>>>>> I have used over subscription of GPUs fairly routinely but it >>>>>>> requires driver support (and I think at some point it also required >>>>>>> a patched mpich, but that requirement is gone AFAIK). I don't >>>>>>> remember what driver version is needed. Can you get the driver >>>>>>> version on the test machine with nvidia-smi? >>>>>>> >>>>>>> Also over subscription by such a large factor could be an issue. >>>>>>> But given that the example doesn't actually use GPUs one would hope >>>>>>> that it shouldn't matter ... >>>>>>> >>>>>>> Karl, have you been able to reproduce this issue on a different >>>>>>> machine? Or any idea what's needed to reproduce the failures? >>>>>>> I can try and hunt down a sm_13 GPU but if there's an easier way to >>>>>>> reproduce that would be great. >>>>>>> >>>>>>> Cheers, >>>>>>> Dominic >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 21, 2016 at 12:11:08PM -0500, Satish Balay wrote: >>>>>>>> I attempted to manually run the tests after the reboot - and then >>>>>>>> they crashed/hanged >>>>>>>> at: >>>>>>>> >>>>>>>> [14]PETSC ERROR: --------------------- Error Message >>>>>>>> -------------------------------------------------------------- >>>>>>>> [14]PETSC ERROR: Error in external library [14]PETSC ERROR: CUBLAS >>>>>>>> error 1 [14]PETSC ERROR: See >>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>>> shooting. >>>>>>>> [14]PETSC ERROR: Petsc Development GIT revision: >>>>>>>> pre-tsfc-2225-g6da9565 GIT Date: 2016-03-20 23:47:14 -0500 >>>>>>>> [14]PETSC ERROR: ./ex36 on a arch-cuda-double named frog by balay >>>>>>>> Mon Mar 21 10:49:24 2016 [14]PETSC ERROR: Configure options >>>>>>>> --with-cuda=1 --with-cusp=1 >>>>>>>> -with-cusp-dir=/home/balay/soft/cusplibrary-0.4.0 --with-thrust=1 >>>>>>>> --with-precision=double --with-clanguage=c --with-cuda-arch=sm_13 >>>>>>>> --with-no-output -PETSC_ARCH=arch-cuda-double >>>>>>>> -PETSC_DIR=/home/balay/petsc.clone >>>>>>>> [14]PETSC ERROR: #1 PetscInitialize() line 922 in >>>>>>>> /home/balay/petsc.clone/src/sys/objects/pinit.c >>>>>>>> >>>>>>>> >>>>>>>> This one does: 'mpiexec -n 32 ./ex36' >>>>>>>> >>>>>>>> Does such oversubscription of GPU supporsed to work? BTW: I don't >>>>>>>> think this example is using cuda [but there is still cublas >>>>>>>> initialization?] >>>>>>>> >>>>>>>> I've rebooted the machine again - and the 'day'builds have just >>>>>>>> started.. >>>>>>>> >>>>>>>> Satish >>>>>>>> >>>>>>>> >>>>>>>> On Mon, 21 Mar 2016, Karl Rupp wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> the reboot may help, yes. I've observed such weird test failures >>>>>>>>> twice over the years. In both cases they were gone after >>>>>>>>> powering the machine off and powering them on again (at least in >>>>> one case it was not sufficient to reboot). >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Karli >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 03/21/2016 04:38 PM, Satish Balay wrote: >>>>>>>>>> The test mode [on this machine] didn't change in the past few >>>>> months.. >>>>>>>>>> >>>>>>>>>> I've rebooted the box now.. >>>>>>>>>> >>>>>>>>>> Satish >>>>>>>>>> >>>>>>>>>> On Mon, 21 Mar 2016, Dominic Meiser wrote: >>>>>>>>>> >>>>>>>>>>> Really odd that these out-of-memory errors are occurring now. >>>>>>>>>>> AFAIK nothing related to this has changed in the code. Are >>>>>>>>>>> the tests run any differently? Perhaps more tests in >>>>>>>>>>> parallel? Is it possible to reset the driver or to reboot the test >>>>> machine? >>>>>>>>>>> >>>>>>>>>>> Dominic >>>>>>>>>>> >>>>>>>>>>> On Sun, Mar 20, 2016 at 09:12:36PM -0500, Barry Smith wrote: >>>>>>>>>>>> >>>>>>>>>>>> ftp://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2016/0 >>>>>>>>>>>> 3/20/examples_master_arch-cuda-double_frog.log >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Dominic Meiser >>>>> Tech-X Corporation - 5621 Arapahoe Avenue - Boulder, CO 80303 >>>> NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361 >>>> Managing Director: Karen Theresa Burns >>>> >>>> ----------------------------------------------------------------------------------- >>>> This email message is for the sole use of the intended recipient(s) and >>>> may contain >>>> confidential information. Any unauthorized review, use, disclosure or >>>> distribution >>>> is prohibited. If you are not the intended recipient, please contact the >>>> sender by >>>> reply email and destroy all copies of the original message. >>>> ----------------------------------------------------------------------------------- >>>
