Doesn't Nvidia supply a "valgrind" like tool that will allow tracking memory 
usage? I'm pretty sure I've seen one; it should be able to show memory usage as 
a function of time so you can see where the memory is being allocated
  
  Barry


> On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch <jacob....@gmail.com> wrote:
> 
>> it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists 
>> across the entire running time of an application. cupm_initialize 
>> contributes 0.36GB out of 0.73GB.
> 
> If I had to guess this may be the latent overhead of CUDA streams and events, 
> but even then 360 MB seems ludicrous. CUDA maintains a persistent pool of 
> streams that is not freed until cudaDeviceReset() is called. Maybe they 
> initialize this pool immediately on start-up of the context? AFAIK there is 
> no way to disable or modify this behavior.
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
>> On Jan 7, 2022, at 13:23, Zhang, Hong <hongzh...@anl.gov 
>> <mailto:hongzh...@anl.gov>> wrote:
>> 
>> Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes 
>> 0.73GB CUDA memory and this overhead persists across the entire running time 
>> of an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is 
>> still unclear what takes the remaining 0.37GB.
>> 
>> The torch issue is really a mystery. If I import torch only and do some 
>> tensor operations on GPU, it consumes only 0.004GB CUDA memory.    
>> 
>> 
>>> On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev 
>>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote:
>>> 
>>> 
>>> 1. Commenting out  ierr = 
>>> __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in 
>>> device/impls/cupm/cupmcontext.hpp:L199
>>> 
>>> CUDA memory: 1.575GB
>>> CUDA memory without importing torch:  0.370GB
>>> 
>>> This has the same effect as commenting out L437-L440 in 
>>> interface/device.cxx 
>>> 
>>> 2. Comment out these two: 
>>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
>>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
>>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
>>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
>>> 
>>> CUDA memory: 1.936GB
>>> CUDA memory without importing torch:   0.730GB
>>> 
>>>> On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <jacob....@gmail.com 
>>>> <mailto:jacob....@gmail.com>> wrote:
>>>> 
>>>>> They had no influence to the memory usage. 
>>>> ???????????????????????????????????????????????????????????????????????
>>>> 
>>>> Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 
>>>> 360 in cupmdevice.cxx as well.
>>>> 
>>>> Best regards,
>>>> 
>>>> Jacob Faibussowitsch
>>>> (Jacob Fai - booss - oh - vitch)
>>>> 
>>>>> On Jan 7, 2022, at 12:18, Zhang, Hong <hongzh...@anl.gov 
>>>>> <mailto:hongzh...@anl.gov>> wrote:
>>>>> 
>>>>> I have tried all of these. They had no influence to the memory usage. 
>>>>> 
>>>>>> On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <jacob....@gmail.com 
>>>>>> <mailto:jacob....@gmail.com>> wrote:
>>>>>> 
>>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I 
>>>>>>> did the following to turn them off:
>>>>>> 
>>>>>> Ok next things to try out in order:
>>>>>> 
>>>>>> 1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 
>>>>>> [PetscFunctionBegin;] 
>>>>>> Put a PetscFunctionReturn(0); right after this
>>>>>> 
>>>>>> 2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
>>>>>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
>>>>>> Comment this out
>>>>>> 
>>>>>> 3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
>>>>>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
>>>>>> Comment this out
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Jacob Faibussowitsch
>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>> 
>>>>>>> On Jan 7, 2022, at 12:02, Zhang, Hong <hongzh...@anl.gov 
>>>>>>> <mailto:hongzh...@anl.gov>> wrote:
>>>>>>> 
>>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I 
>>>>>>> did the following to turn them off:
>>>>>>> 
>>>>>>> diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
>>>>>>> b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>>> index 51fed809e4d..9a5f068323a 100644
>>>>>>> --- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>>> +++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
>>>>>>> @@ -199,7 +199,7 @@ inline PetscErrorCode 
>>>>>>> CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept
>>>>>>>  #if PetscDefined(USE_DEBUG)
>>>>>>>    dci->timerInUse = PETSC_FALSE;
>>>>>>>  #endif
>>>>>>> -  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
>>>>>>> +  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
>>>>>>>    PetscFunctionReturn(0);
>>>>>>>  }
>>>>>>> 
>>>>>>>> On Jan 7, 2022, at 10:53 AM, Barry Smith <bsm...@petsc.dev 
>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   I don't think this is right. We want the device initialized by PETSc 
>>>>>>>> , we just don't want the cublas and cusolve stuff initialized. In 
>>>>>>>> order to see how much memory initializing the blas and solvers takes.
>>>>>>>> 
>>>>>>>>   So I think you need to comment things in cupminterface.hpp like 
>>>>>>>> cublasCreate and cusolverDnCreate.
>>>>>>>> 
>>>>>>>>   Urgh, I hate C++ where huge chunks of real code are in header files.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch 
>>>>>>>>> <jacob....@gmail.com <mailto:jacob....@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hit send too early…
>>>>>>>>> 
>>>>>>>>> If you don’t want to comment out, you can also run with 
>>>>>>>>> "-device_enable lazy" option. Normally this is the default behavior 
>>>>>>>>> but if -log_view or -log_summary is provided this defaults to 
>>>>>>>>> “-device_enable eager”. See 
>>>>>>>>> src/sys/objects/device/interface/device.cxx:398
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> 
>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>> 
>>>>>>>>>> On Jan 7, 2022, at 11:29, Jacob Faibussowitsch <jacob....@gmail.com 
>>>>>>>>>> <mailto:jacob....@gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> You need to go into the PetscInitialize() routine find where it 
>>>>>>>>>>> loads the cublas and cusolve and comment out those lines then run 
>>>>>>>>>>> with -log_view
>>>>>>>>>> 
>>>>>>>>>> Comment out
>>>>>>>>>> 
>>>>>>>>>> #if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
>>>>>>>>>> PetscDefined(HAVE_SYCL))
>>>>>>>>>>   ierr = 
>>>>>>>>>> PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
>>>>>>>>>> #endif
>>>>>>>>>> 
>>>>>>>>>> At src/sys/objects/pinit.c:956
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> 
>>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>>> 
>>>>>>>>>>> On Jan 7, 2022, at 11:24, Barry Smith <bsm...@petsc.dev 
>>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Without log_view it does not load any cuBLAS/cuSolve immediately 
>>>>>>>>>>> with -log_view it loads all that stuff at startup. You need to go 
>>>>>>>>>>> into the PetscInitialize() routine find where it loads the cublas 
>>>>>>>>>>> and cusolve and comment out those lines then run with -log_view
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
>>>>>>>>>>>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> When PETSc is initialized, it takes about 2GB CUDA memory. This is 
>>>>>>>>>>>> way too much for doing nothing. A test script is attached to 
>>>>>>>>>>>> reproduce the issue. If I remove the first line "import torch", 
>>>>>>>>>>>> PETSc consumes about 0.73GB, which is still significant. Does 
>>>>>>>>>>>> anyone have any idea about this behavior?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Hong
>>>>>>>>>>>> 
>>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples
>>>>>>>>>>>>  (caidao22/update-examples)$ python3 test.py
>>>>>>>>>>>> CUDA memory before PETSc 0.000GB
>>>>>>>>>>>> CUDA memory after PETSc 0.004GB
>>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples
>>>>>>>>>>>>  (caidao22/update-examples)$ python3 test.py -log_view :0.txt
>>>>>>>>>>>> CUDA memory before PETSc 0.000GB
>>>>>>>>>>>> CUDA memory after PETSc 1.936GB
>>>>>>>>>>>> 
>>>>>>>>>>>> import torch
>>>>>>>>>>>> import sys
>>>>>>>>>>>> import os
>>>>>>>>>>>> 
>>>>>>>>>>>> import nvidia_smi
>>>>>>>>>>>> nvidia_smi.nvmlInit()
>>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
>>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
>>>>>>>>>>>> print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))
>>>>>>>>>>>> 
>>>>>>>>>>>> petsc4py_path = 
>>>>>>>>>>>> os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
>>>>>>>>>>>> sys.path.append(petsc4py_path)
>>>>>>>>>>>> import petsc4py
>>>>>>>>>>>> petsc4py.init(sys.argv)
>>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
>>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
>>>>>>>>>>>> print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to