> it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists > across the entire running time of an application. cupm_initialize contributes > 0.36GB out of 0.73GB.
If I had to guess this may be the latent overhead of CUDA streams and events, but even then 360 MB seems ludicrous. CUDA maintains a persistent pool of streams that is not freed until cudaDeviceReset() is called. Maybe they initialize this pool immediately on start-up of the context? AFAIK there is no way to disable or modify this behavior. Best regards, Jacob Faibussowitsch (Jacob Fai - booss - oh - vitch) > On Jan 7, 2022, at 13:23, Zhang, Hong <hongzh...@anl.gov> wrote: > > Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes > 0.73GB CUDA memory and this overhead persists across the entire running time > of an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is > still unclear what takes the remaining 0.37GB. > > The torch issue is really a mystery. If I import torch only and do some > tensor operations on GPU, it consumes only 0.004GB CUDA memory. > > >> On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev >> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote: >> >> >> 1. Commenting out ierr = >> __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in >> device/impls/cupm/cupmcontext.hpp:L199 >> >> CUDA memory: 1.575GB >> CUDA memory without importing torch: 0.370GB >> >> This has the same effect as commenting out L437-L440 in interface/device.cxx >> >> 2. Comment out these two: >> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = >> _devices[_defaultDevice]->configure();CHKERRQ(ierr);] >> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = >> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);] >> >> CUDA memory: 1.936GB >> CUDA memory without importing torch: 0.730GB >> >>> On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <jacob....@gmail.com >>> <mailto:jacob....@gmail.com>> wrote: >>> >>>> They had no influence to the memory usage. >>> ??????????????????????????????????????????????????????????????????????? >>> >>> Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line >>> 360 in cupmdevice.cxx as well. >>> >>> Best regards, >>> >>> Jacob Faibussowitsch >>> (Jacob Fai - booss - oh - vitch) >>> >>>> On Jan 7, 2022, at 12:18, Zhang, Hong <hongzh...@anl.gov >>>> <mailto:hongzh...@anl.gov>> wrote: >>>> >>>> I have tried all of these. They had no influence to the memory usage. >>>> >>>>> On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <jacob....@gmail.com >>>>> <mailto:jacob....@gmail.com>> wrote: >>>>> >>>>>> Initializing cutlass and cusolver does not affect the memory usage. I >>>>>> did the following to turn them off: >>>>> >>>>> Ok next things to try out in order: >>>>> >>>>> 1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 >>>>> [PetscFunctionBegin;] >>>>> Put a PetscFunctionReturn(0); right after this >>>>> >>>>> 2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = >>>>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);] >>>>> Comment this out >>>>> >>>>> 3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = >>>>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);] >>>>> Comment this out >>>>> >>>>> Best regards, >>>>> >>>>> Jacob Faibussowitsch >>>>> (Jacob Fai - booss - oh - vitch) >>>>> >>>>>> On Jan 7, 2022, at 12:02, Zhang, Hong <hongzh...@anl.gov >>>>>> <mailto:hongzh...@anl.gov>> wrote: >>>>>> >>>>>> Initializing cutlass and cusolver does not affect the memory usage. I >>>>>> did the following to turn them off: >>>>>> >>>>>> diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>> b/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>> index 51fed809e4d..9a5f068323a 100644 >>>>>> --- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>> +++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>> @@ -199,7 +199,7 @@ inline PetscErrorCode >>>>>> CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept >>>>>> #if PetscDefined(USE_DEBUG) >>>>>> dci->timerInUse = PETSC_FALSE; >>>>>> #endif >>>>>> - ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); >>>>>> + //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); >>>>>> PetscFunctionReturn(0); >>>>>> } >>>>>> >>>>>>> On Jan 7, 2022, at 10:53 AM, Barry Smith <bsm...@petsc.dev >>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>> >>>>>>> >>>>>>> I don't think this is right. We want the device initialized by PETSc >>>>>>> , we just don't want the cublas and cusolve stuff initialized. In order >>>>>>> to see how much memory initializing the blas and solvers takes. >>>>>>> >>>>>>> So I think you need to comment things in cupminterface.hpp like >>>>>>> cublasCreate and cusolverDnCreate. >>>>>>> >>>>>>> Urgh, I hate C++ where huge chunks of real code are in header files. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch <jacob....@gmail.com >>>>>>>> <mailto:jacob....@gmail.com>> wrote: >>>>>>>> >>>>>>>> Hit send too early… >>>>>>>> >>>>>>>> If you don’t want to comment out, you can also run with >>>>>>>> "-device_enable lazy" option. Normally this is the default behavior >>>>>>>> but if -log_view or -log_summary is provided this defaults to >>>>>>>> “-device_enable eager”. See >>>>>>>> src/sys/objects/device/interface/device.cxx:398 >>>>>>>> >>>>>>>> Best regards, >>>>>>>> >>>>>>>> Jacob Faibussowitsch >>>>>>>> (Jacob Fai - booss - oh - vitch) >>>>>>>> >>>>>>>>> On Jan 7, 2022, at 11:29, Jacob Faibussowitsch <jacob....@gmail.com >>>>>>>>> <mailto:jacob....@gmail.com>> wrote: >>>>>>>>> >>>>>>>>>> You need to go into the PetscInitialize() routine find where it >>>>>>>>>> loads the cublas and cusolve and comment out those lines then run >>>>>>>>>> with -log_view >>>>>>>>> >>>>>>>>> Comment out >>>>>>>>> >>>>>>>>> #if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || >>>>>>>>> PetscDefined(HAVE_SYCL)) >>>>>>>>> ierr = >>>>>>>>> PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr); >>>>>>>>> #endif >>>>>>>>> >>>>>>>>> At src/sys/objects/pinit.c:956 >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> Jacob Faibussowitsch >>>>>>>>> (Jacob Fai - booss - oh - vitch) >>>>>>>>> >>>>>>>>>> On Jan 7, 2022, at 11:24, Barry Smith <bsm...@petsc.dev >>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Without log_view it does not load any cuBLAS/cuSolve immediately >>>>>>>>>> with -log_view it loads all that stuff at startup. You need to go >>>>>>>>>> into the PetscInitialize() routine find where it loads the cublas >>>>>>>>>> and cusolve and comment out those lines then run with -log_view >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev >>>>>>>>>>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote: >>>>>>>>>>> >>>>>>>>>>> When PETSc is initialized, it takes about 2GB CUDA memory. This is >>>>>>>>>>> way too much for doing nothing. A test script is attached to >>>>>>>>>>> reproduce the issue. If I remove the first line "import torch", >>>>>>>>>>> PETSc consumes about 0.73GB, which is still significant. Does >>>>>>>>>>> anyone have any idea about this behavior? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Hong >>>>>>>>>>> >>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples >>>>>>>>>>> (caidao22/update-examples)$ python3 test.py >>>>>>>>>>> CUDA memory before PETSc 0.000GB >>>>>>>>>>> CUDA memory after PETSc 0.004GB >>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples >>>>>>>>>>> (caidao22/update-examples)$ python3 test.py -log_view :0.txt >>>>>>>>>>> CUDA memory before PETSc 0.000GB >>>>>>>>>>> CUDA memory after PETSc 1.936GB >>>>>>>>>>> >>>>>>>>>>> import torch >>>>>>>>>>> import sys >>>>>>>>>>> import os >>>>>>>>>>> >>>>>>>>>>> import nvidia_smi >>>>>>>>>>> nvidia_smi.nvmlInit() >>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0) >>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle) >>>>>>>>>>> print('CUDA memory before PETSc %.3fGB' % (info.used/1e9)) >>>>>>>>>>> >>>>>>>>>>> petsc4py_path = >>>>>>>>>>> os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib') >>>>>>>>>>> sys.path.append(petsc4py_path) >>>>>>>>>>> import petsc4py >>>>>>>>>>> petsc4py.init(sys.argv) >>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0) >>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle) >>>>>>>>>>> print('CUDA memory after PETSc %.3fGB' % (info.used/1e9)) >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >