> it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists 
> across the entire running time of an application. cupm_initialize contributes 
> 0.36GB out of 0.73GB.

If I had to guess this may be the latent overhead of CUDA streams and events, 
but even then 360 MB seems ludicrous. CUDA maintains a persistent pool of 
streams that is not freed until cudaDeviceReset() is called. Maybe they 
initialize this pool immediately on start-up of the context? AFAIK there is no 
way to disable or modify this behavior.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Jan 7, 2022, at 13:23, Zhang, Hong <hongzh...@anl.gov> wrote:
> 
> Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes 
> 0.73GB CUDA memory and this overhead persists across the entire running time 
> of an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is 
> still unclear what takes the remaining 0.37GB.
> 
> The torch issue is really a mystery. If I import torch only and do some 
> tensor operations on GPU, it consumes only 0.004GB CUDA memory.    
> 
> 
>> On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev 
>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote:
>> 
>> 
>> 1. Commenting out  ierr = 
>> __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in 
>> device/impls/cupm/cupmcontext.hpp:L199
>> 
>> CUDA memory: 1.575GB
>> CUDA memory without importing torch:  0.370GB
>> 
>> This has the same effect as commenting out L437-L440 in interface/device.cxx 
>> 
>> 2. Comment out these two: 
>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
>> 
>> CUDA memory: 1.936GB
>> CUDA memory without importing torch:   0.730GB
>> 
>>> On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <jacob....@gmail.com 
>>> <mailto:jacob....@gmail.com>> wrote:
>>> 
>>>> They had no influence to the memory usage. 
>>> ???????????????????????????????????????????????????????????????????????
>>> 
>>> Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 
>>> 360 in cupmdevice.cxx as well.
>>> 
>>> Best regards,
>>> 
>>> Jacob Faibussowitsch
>>> (Jacob Fai - booss - oh - vitch)
>>> 
>>>> On Jan 7, 2022, at 12:18, Zhang, Hong <hongzh...@anl.gov 
>>>> <mailto:hongzh...@anl.gov>> wrote:
>>>> 
>>>> I have tried all of these. They had no influence to the memory usage. 
>>>> 
>>>>> On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <jacob....@gmail.com 
>>>>> <mailto:jacob....@gmail.com>> wrote:
>>>>> 
>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I 
>>>>>> did the following to turn them off:
>>>>> 
>>>>> Ok next things to try out in order:
>>>>> 
>>>>> 1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 
>>>>> [PetscFunctionBegin;] 
>>>>> Put a PetscFunctionReturn(0); right after this
>>>>> 
>>>>> 2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
>>>>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
>>>>> Comment this out
>>>>> 
>>>>> 3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
>>>>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
>>>>> Comment this out
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Jacob Faibussowitsch
>>>>> (Jacob Fai - booss - oh - vitch)
>>>>> 
>>>>>> On Jan 7, 2022, at 12:02, Zhang, Hong <hongzh...@anl.gov 
>>>>>> <mailto:hongzh...@anl.gov>> wrote:
>>>>>> 
>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I 
>>>>>> did the following to turn them off:
>>>>>> 
>>>>>> diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
>>>>>> b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>> index 51fed809e4d..9a5f068323a 100644
>>>>>> --- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>> +++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>> @@ -199,7 +199,7 @@ inline PetscErrorCode 
>>>>>> CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept
>>>>>>  #if PetscDefined(USE_DEBUG)
>>>>>>    dci->timerInUse = PETSC_FALSE;
>>>>>>  #endif
>>>>>> -  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
>>>>>> +  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
>>>>>>    PetscFunctionReturn(0);
>>>>>>  }
>>>>>> 
>>>>>>> On Jan 7, 2022, at 10:53 AM, Barry Smith <bsm...@petsc.dev 
>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>   I don't think this is right. We want the device initialized by PETSc 
>>>>>>> , we just don't want the cublas and cusolve stuff initialized. In order 
>>>>>>> to see how much memory initializing the blas and solvers takes.
>>>>>>> 
>>>>>>>   So I think you need to comment things in cupminterface.hpp like 
>>>>>>> cublasCreate and cusolverDnCreate.
>>>>>>> 
>>>>>>>   Urgh, I hate C++ where huge chunks of real code are in header files.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch <jacob....@gmail.com 
>>>>>>>> <mailto:jacob....@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hit send too early…
>>>>>>>> 
>>>>>>>> If you don’t want to comment out, you can also run with 
>>>>>>>> "-device_enable lazy" option. Normally this is the default behavior 
>>>>>>>> but if -log_view or -log_summary is provided this defaults to 
>>>>>>>> “-device_enable eager”. See 
>>>>>>>> src/sys/objects/device/interface/device.cxx:398
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> 
>>>>>>>> Jacob Faibussowitsch
>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>> 
>>>>>>>>> On Jan 7, 2022, at 11:29, Jacob Faibussowitsch <jacob....@gmail.com 
>>>>>>>>> <mailto:jacob....@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>>> You need to go into the PetscInitialize() routine find where it 
>>>>>>>>>> loads the cublas and cusolve and comment out those lines then run 
>>>>>>>>>> with -log_view
>>>>>>>>> 
>>>>>>>>> Comment out
>>>>>>>>> 
>>>>>>>>> #if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
>>>>>>>>> PetscDefined(HAVE_SYCL))
>>>>>>>>>   ierr = 
>>>>>>>>> PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
>>>>>>>>> #endif
>>>>>>>>> 
>>>>>>>>> At src/sys/objects/pinit.c:956
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> 
>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>> 
>>>>>>>>>> On Jan 7, 2022, at 11:24, Barry Smith <bsm...@petsc.dev 
>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Without log_view it does not load any cuBLAS/cuSolve immediately 
>>>>>>>>>> with -log_view it loads all that stuff at startup. You need to go 
>>>>>>>>>> into the PetscInitialize() routine find where it loads the cublas 
>>>>>>>>>> and cusolve and comment out those lines then run with -log_view
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
>>>>>>>>>>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> When PETSc is initialized, it takes about 2GB CUDA memory. This is 
>>>>>>>>>>> way too much for doing nothing. A test script is attached to 
>>>>>>>>>>> reproduce the issue. If I remove the first line "import torch", 
>>>>>>>>>>> PETSc consumes about 0.73GB, which is still significant. Does 
>>>>>>>>>>> anyone have any idea about this behavior?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Hong
>>>>>>>>>>> 
>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples
>>>>>>>>>>>  (caidao22/update-examples)$ python3 test.py
>>>>>>>>>>> CUDA memory before PETSc 0.000GB
>>>>>>>>>>> CUDA memory after PETSc 0.004GB
>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples
>>>>>>>>>>>  (caidao22/update-examples)$ python3 test.py -log_view :0.txt
>>>>>>>>>>> CUDA memory before PETSc 0.000GB
>>>>>>>>>>> CUDA memory after PETSc 1.936GB
>>>>>>>>>>> 
>>>>>>>>>>> import torch
>>>>>>>>>>> import sys
>>>>>>>>>>> import os
>>>>>>>>>>> 
>>>>>>>>>>> import nvidia_smi
>>>>>>>>>>> nvidia_smi.nvmlInit()
>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
>>>>>>>>>>> print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))
>>>>>>>>>>> 
>>>>>>>>>>> petsc4py_path = 
>>>>>>>>>>> os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
>>>>>>>>>>> sys.path.append(petsc4py_path)
>>>>>>>>>>> import petsc4py
>>>>>>>>>>> petsc4py.init(sys.argv)
>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
>>>>>>>>>>> print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to