Re: [petsc-dev] PETSc init eats too much CUDA memory

Jacob Faibussowitsch Sat, 08 Jan 2022 12:36:13 -0800

> The memory overhead (for both CPU and GPU) of PyTorch is getting worse and 
> worse as it evolves. A conjecture is that the CUDA kernels in the library are 
> responsible for this. But the overhead for Tensorflow2 is just around 300MB 
> (compare to 1.5GB for PyTorch).


I read through the thread here and the TL;DR is that CUDA will load all device 
symbols/code when CUDA runtime is initialized. Importantly, this behavior is 
not just triggered when CUDA initializes but also when any of the derivative 
libraries (cublas, cusolver, etc.) are loaded as well.

This could really be shooting us in the foot here. Not only do we initialize 
both cublas and cusolver but we also pull in a __ton__ of thrust, which I can’t 
imagine is a negligible overhead gpu-code-wise.

Not sure how to square this, or even if we can resolve this on our side. Good 
news is that this finally explains the random CI failures due to out of memory!

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Jan 8, 2022, at 13:36, Zhang, Hong <hongzh...@anl.gov> wrote:
> 
> Here is an interesting thread discussing the memory issue for PyTorch (which 
> I think is also relevant to PETSc):
> 
> https://github.com/pytorch/pytorch/issues/12873 
> <https://github.com/pytorch/pytorch/issues/12873>
> 
> The memory overhead (for both CPU and GPU) of PyTorch is getting worse and 
> worse as it evolves. A conjecture is that the CUDA kernels in the library are 
> responsible for this. But the overhead for Tensorflow2 is just around 300MB 
> (compare to 1.5GB for PyTorch).
> 
> According to the discussion, there has not been a good way to decrease the 
> memory overhead for PyTorch yet. Someone noticed that “removing half of the 
> CUDA kernels can reduce the memory usage by half." 
> 
> Hong
> 
>> On Jan 7, 2022, at 9:23 PM, Barry Smith <bsm...@petsc.dev> wrote:
>> 
>> 
>>   Doesn't Nvidia supply a "valgrind" like tool that will allow tracking 
>> memory usage? I'm pretty sure I've seen one; it should be able to show 
>> memory usage as a function of time so you can see where the memory is being 
>> allocated
>>   
>>   Barry
>> 
>> 
>>> On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch <jacob....@gmail.com 
>>> <mailto:jacob....@gmail.com>> wrote:
>>> 
>>>> it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists 
>>>> across the entire running time of an application. cupm_initialize 
>>>> contributes 0.36GB out of 0.73GB.
>>> 
>>> If I had to guess this may be the latent overhead of CUDA streams and 
>>> events, but even then 360 MB seems ludicrous. CUDA maintains a persistent 
>>> pool of streams that is not freed until cudaDeviceReset() is called. Maybe 
>>> they initialize this pool immediately on start-up of the context? AFAIK 
>>> there is no way to disable or modify this behavior.
>>> 
>>> Best regards,
>>> 
>>> Jacob Faibussowitsch
>>> (Jacob Fai - booss - oh - vitch)
>>> 
>>>> On Jan 7, 2022, at 13:23, Zhang, Hong <hongzh...@anl.gov 
>>>> <mailto:hongzh...@anl.gov>> wrote:
>>>> 
>>>> Apart from the 1.2GB caused by importing torch, it seems that PETSc 
>>>> consumes 0.73GB CUDA memory and this overhead persists across the entire 
>>>> running time of an application. cupm_initialize contributes 0.36GB out of 
>>>> 0.73GB. It is still unclear what takes the remaining 0.37GB.
>>>> 
>>>> The torch issue is really a mystery. If I import torch only and do some 
>>>> tensor operations on GPU, it consumes only 0.004GB CUDA memory.    
>>>> 
>>>> 
>>>>> On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev 
>>>>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote:
>>>>> 
>>>>> 
>>>>> 1. Commenting out  ierr = 
>>>>> __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in 
>>>>> device/impls/cupm/cupmcontext.hpp:L199
>>>>> 
>>>>> CUDA memory: 1.575GB
>>>>> CUDA memory without importing torch:  0.370GB
>>>>> 
>>>>> This has the same effect as commenting out L437-L440 in 
>>>>> interface/device.cxx 
>>>>> 
>>>>> 2. Comment out these two: 
>>>>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
>>>>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
>>>>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
>>>>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
>>>>> 
>>>>> CUDA memory: 1.936GB
>>>>> CUDA memory without importing torch:   0.730GB
>>>>> 
>>>>>> On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <jacob....@gmail.com 
>>>>>> <mailto:jacob....@gmail.com>> wrote:
>>>>>> 
>>>>>>> They had no influence to the memory usage. 
>>>>>> ???????????????????????????????????????????????????????????????????????
>>>>>> 
>>>>>> Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 
>>>>>> 360 in cupmdevice.cxx as well.
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Jacob Faibussowitsch
>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>> 
>>>>>>> On Jan 7, 2022, at 12:18, Zhang, Hong <hongzh...@anl.gov 
>>>>>>> <mailto:hongzh...@anl.gov>> wrote:
>>>>>>> 
>>>>>>> I have tried all of these. They had no influence to the memory usage. 
>>>>>>> 
>>>>>>>> On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <jacob....@gmail.com 
>>>>>>>> <mailto:jacob....@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I 
>>>>>>>>> did the following to turn them off:
>>>>>>>> 
>>>>>>>> Ok next things to try out in order:
>>>>>>>> 
>>>>>>>> 1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 
>>>>>>>> [PetscFunctionBegin;] 
>>>>>>>> Put a PetscFunctionReturn(0); right after this
>>>>>>>> 
>>>>>>>> 2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
>>>>>>>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
>>>>>>>> Comment this out
>>>>>>>> 
>>>>>>>> 3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
>>>>>>>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
>>>>>>>> Comment this out
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> 
>>>>>>>> Jacob Faibussowitsch
>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>> 
>>>>>>>>> On Jan 7, 2022, at 12:02, Zhang, Hong <hongzh...@anl.gov 
>>>>>>>>> <mailto:hongzh...@anl.gov>> wrote:
>>>>>>>>> 
>>>>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I 
>>>>>>>>> did the following to turn them off:
>>>>>>>>> 
>>>>>>>>> diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
>>>>>>>>> b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>>>>> index 51fed809e4d..9a5f068323a 100644
>>>>>>>>> --- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>>>>> +++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>>>>> @@ -199,7 +199,7 @@ inline PetscErrorCode 
>>>>>>>>> CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept
>>>>>>>>>  #if PetscDefined(USE_DEBUG)
>>>>>>>>>    dci->timerInUse = PETSC_FALSE;
>>>>>>>>>  #endif
>>>>>>>>> -  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
>>>>>>>>> +  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
>>>>>>>>>    PetscFunctionReturn(0);
>>>>>>>>>  }
>>>>>>>>> 
>>>>>>>>>> On Jan 7, 2022, at 10:53 AM, Barry Smith <bsm...@petsc.dev 
>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>   I don't think this is right. We want the device initialized by 
>>>>>>>>>> PETSc , we just don't want the cublas and cusolve stuff initialized. 
>>>>>>>>>> In order to see how much memory initializing the blas and solvers 
>>>>>>>>>> takes.
>>>>>>>>>> 
>>>>>>>>>>   So I think you need to comment things in cupminterface.hpp like 
>>>>>>>>>> cublasCreate and cusolverDnCreate.
>>>>>>>>>> 
>>>>>>>>>>   Urgh, I hate C++ where huge chunks of real code are in header 
>>>>>>>>>> files.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch 
>>>>>>>>>>> <jacob....@gmail.com <mailto:jacob....@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hit send too early…
>>>>>>>>>>> 
>>>>>>>>>>> If you don’t want to comment out, you can also run with 
>>>>>>>>>>> "-device_enable lazy" option. Normally this is the default behavior 
>>>>>>>>>>> but if -log_view or -log_summary is provided this defaults to 
>>>>>>>>>>> “-device_enable eager”. See 
>>>>>>>>>>> src/sys/objects/device/interface/device.cxx:398
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> 
>>>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 7, 2022, at 11:29, Jacob Faibussowitsch 
>>>>>>>>>>>> <jacob....@gmail.com <mailto:jacob....@gmail.com>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> You need to go into the PetscInitialize() routine find where it 
>>>>>>>>>>>>> loads the cublas and cusolve and comment out those lines then run 
>>>>>>>>>>>>> with -log_view
>>>>>>>>>>>> 
>>>>>>>>>>>> Comment out
>>>>>>>>>>>> 
>>>>>>>>>>>> #if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
>>>>>>>>>>>> PetscDefined(HAVE_SYCL))
>>>>>>>>>>>>   ierr = 
>>>>>>>>>>>> PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
>>>>>>>>>>>> #endif
>>>>>>>>>>>> 
>>>>>>>>>>>> At src/sys/objects/pinit.c:956
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 7, 2022, at 11:24, Barry Smith <bsm...@petsc.dev 
>>>>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Without log_view it does not load any cuBLAS/cuSolve immediately 
>>>>>>>>>>>>> with -log_view it loads all that stuff at startup. You need to go 
>>>>>>>>>>>>> into the PetscInitialize() routine find where it loads the cublas 
>>>>>>>>>>>>> and cusolve and comment out those lines then run with -log_view
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
>>>>>>>>>>>>>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> When PETSc is initialized, it takes about 2GB CUDA memory. This 
>>>>>>>>>>>>>> is way too much for doing nothing. A test script is attached to 
>>>>>>>>>>>>>> reproduce the issue. If I remove the first line "import torch", 
>>>>>>>>>>>>>> PETSc consumes about 0.73GB, which is still significant. Does 
>>>>>>>>>>>>>> anyone have any idea about this behavior?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Hong
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples
>>>>>>>>>>>>>>  (caidao22/update-examples)$ python3 test.py
>>>>>>>>>>>>>> CUDA memory before PETSc 0.000GB
>>>>>>>>>>>>>> CUDA memory after PETSc 0.004GB
>>>>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples
>>>>>>>>>>>>>>  (caidao22/update-examples)$ python3 test.py -log_view :0.txt
>>>>>>>>>>>>>> CUDA memory before PETSc 0.000GB
>>>>>>>>>>>>>> CUDA memory after PETSc 1.936GB
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> import torch
>>>>>>>>>>>>>> import sys
>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> import nvidia_smi
>>>>>>>>>>>>>> nvidia_smi.nvmlInit()
>>>>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
>>>>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
>>>>>>>>>>>>>> print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> petsc4py_path = 
>>>>>>>>>>>>>> os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
>>>>>>>>>>>>>> sys.path.append(petsc4py_path)
>>>>>>>>>>>>>> import petsc4py
>>>>>>>>>>>>>> petsc4py.init(sys.argv)
>>>>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
>>>>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
>>>>>>>>>>>>>> print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: [petsc-dev] PETSc init eats too much CUDA memory

Reply via email to