Doesn't Nvidia supply a "valgrind" like tool that will allow tracking memory usage? I'm pretty sure I've seen one; it should be able to show memory usage as a function of time so you can see where the memory is being allocated Barry
> On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch <jacob....@gmail.com> wrote: > >> it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists >> across the entire running time of an application. cupm_initialize >> contributes 0.36GB out of 0.73GB. > > If I had to guess this may be the latent overhead of CUDA streams and events, > but even then 360 MB seems ludicrous. CUDA maintains a persistent pool of > streams that is not freed until cudaDeviceReset() is called. Maybe they > initialize this pool immediately on start-up of the context? AFAIK there is > no way to disable or modify this behavior. > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > >> On Jan 7, 2022, at 13:23, Zhang, Hong <hongzh...@anl.gov >> <mailto:hongzh...@anl.gov>> wrote: >> >> Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes >> 0.73GB CUDA memory and this overhead persists across the entire running time >> of an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is >> still unclear what takes the remaining 0.37GB. >> >> The torch issue is really a mystery. If I import torch only and do some >> tensor operations on GPU, it consumes only 0.004GB CUDA memory. >> >> >>> On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev >>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote: >>> >>> >>> 1. Commenting out ierr = >>> __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in >>> device/impls/cupm/cupmcontext.hpp:L199 >>> >>> CUDA memory: 1.575GB >>> CUDA memory without importing torch: 0.370GB >>> >>> This has the same effect as commenting out L437-L440 in >>> interface/device.cxx >>> >>> 2. Comment out these two: >>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = >>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);] >>> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = >>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);] >>> >>> CUDA memory: 1.936GB >>> CUDA memory without importing torch: 0.730GB >>> >>>> On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <jacob....@gmail.com >>>> <mailto:jacob....@gmail.com>> wrote: >>>> >>>>> They had no influence to the memory usage. >>>> ??????????????????????????????????????????????????????????????????????? >>>> >>>> Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line >>>> 360 in cupmdevice.cxx as well. >>>> >>>> Best regards, >>>> >>>> Jacob Faibussowitsch >>>> (Jacob Fai - booss - oh - vitch) >>>> >>>>> On Jan 7, 2022, at 12:18, Zhang, Hong <hongzh...@anl.gov >>>>> <mailto:hongzh...@anl.gov>> wrote: >>>>> >>>>> I have tried all of these. They had no influence to the memory usage. >>>>> >>>>>> On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <jacob....@gmail.com >>>>>> <mailto:jacob....@gmail.com>> wrote: >>>>>> >>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I >>>>>>> did the following to turn them off: >>>>>> >>>>>> Ok next things to try out in order: >>>>>> >>>>>> 1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 >>>>>> [PetscFunctionBegin;] >>>>>> Put a PetscFunctionReturn(0); right after this >>>>>> >>>>>> 2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = >>>>>> _devices[_defaultDevice]->configure();CHKERRQ(ierr);] >>>>>> Comment this out >>>>>> >>>>>> 3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = >>>>>> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);] >>>>>> Comment this out >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Jacob Faibussowitsch >>>>>> (Jacob Fai - booss - oh - vitch) >>>>>> >>>>>>> On Jan 7, 2022, at 12:02, Zhang, Hong <hongzh...@anl.gov >>>>>>> <mailto:hongzh...@anl.gov>> wrote: >>>>>>> >>>>>>> Initializing cutlass and cusolver does not affect the memory usage. I >>>>>>> did the following to turn them off: >>>>>>> >>>>>>> diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>>> b/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>>> index 51fed809e4d..9a5f068323a 100644 >>>>>>> --- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>>> +++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp >>>>>>> @@ -199,7 +199,7 @@ inline PetscErrorCode >>>>>>> CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept >>>>>>> #if PetscDefined(USE_DEBUG) >>>>>>> dci->timerInUse = PETSC_FALSE; >>>>>>> #endif >>>>>>> - ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); >>>>>>> + //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); >>>>>>> PetscFunctionReturn(0); >>>>>>> } >>>>>>> >>>>>>>> On Jan 7, 2022, at 10:53 AM, Barry Smith <bsm...@petsc.dev >>>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> I don't think this is right. We want the device initialized by PETSc >>>>>>>> , we just don't want the cublas and cusolve stuff initialized. In >>>>>>>> order to see how much memory initializing the blas and solvers takes. >>>>>>>> >>>>>>>> So I think you need to comment things in cupminterface.hpp like >>>>>>>> cublasCreate and cusolverDnCreate. >>>>>>>> >>>>>>>> Urgh, I hate C++ where huge chunks of real code are in header files. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch >>>>>>>>> <jacob....@gmail.com <mailto:jacob....@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> Hit send too early… >>>>>>>>> >>>>>>>>> If you don’t want to comment out, you can also run with >>>>>>>>> "-device_enable lazy" option. Normally this is the default behavior >>>>>>>>> but if -log_view or -log_summary is provided this defaults to >>>>>>>>> “-device_enable eager”. See >>>>>>>>> src/sys/objects/device/interface/device.cxx:398 >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> Jacob Faibussowitsch >>>>>>>>> (Jacob Fai - booss - oh - vitch) >>>>>>>>> >>>>>>>>>> On Jan 7, 2022, at 11:29, Jacob Faibussowitsch <jacob....@gmail.com >>>>>>>>>> <mailto:jacob....@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>>> You need to go into the PetscInitialize() routine find where it >>>>>>>>>>> loads the cublas and cusolve and comment out those lines then run >>>>>>>>>>> with -log_view >>>>>>>>>> >>>>>>>>>> Comment out >>>>>>>>>> >>>>>>>>>> #if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || >>>>>>>>>> PetscDefined(HAVE_SYCL)) >>>>>>>>>> ierr = >>>>>>>>>> PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr); >>>>>>>>>> #endif >>>>>>>>>> >>>>>>>>>> At src/sys/objects/pinit.c:956 >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> Jacob Faibussowitsch >>>>>>>>>> (Jacob Fai - booss - oh - vitch) >>>>>>>>>> >>>>>>>>>>> On Jan 7, 2022, at 11:24, Barry Smith <bsm...@petsc.dev >>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Without log_view it does not load any cuBLAS/cuSolve immediately >>>>>>>>>>> with -log_view it loads all that stuff at startup. You need to go >>>>>>>>>>> into the PetscInitialize() routine find where it loads the cublas >>>>>>>>>>> and cusolve and comment out those lines then run with -log_view >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev >>>>>>>>>>>> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> When PETSc is initialized, it takes about 2GB CUDA memory. This is >>>>>>>>>>>> way too much for doing nothing. A test script is attached to >>>>>>>>>>>> reproduce the issue. If I remove the first line "import torch", >>>>>>>>>>>> PETSc consumes about 0.73GB, which is still significant. Does >>>>>>>>>>>> anyone have any idea about this behavior? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Hong >>>>>>>>>>>> >>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples >>>>>>>>>>>> (caidao22/update-examples)$ python3 test.py >>>>>>>>>>>> CUDA memory before PETSc 0.000GB >>>>>>>>>>>> CUDA memory after PETSc 0.004GB >>>>>>>>>>>> hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples >>>>>>>>>>>> (caidao22/update-examples)$ python3 test.py -log_view :0.txt >>>>>>>>>>>> CUDA memory before PETSc 0.000GB >>>>>>>>>>>> CUDA memory after PETSc 1.936GB >>>>>>>>>>>> >>>>>>>>>>>> import torch >>>>>>>>>>>> import sys >>>>>>>>>>>> import os >>>>>>>>>>>> >>>>>>>>>>>> import nvidia_smi >>>>>>>>>>>> nvidia_smi.nvmlInit() >>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0) >>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle) >>>>>>>>>>>> print('CUDA memory before PETSc %.3fGB' % (info.used/1e9)) >>>>>>>>>>>> >>>>>>>>>>>> petsc4py_path = >>>>>>>>>>>> os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib') >>>>>>>>>>>> sys.path.append(petsc4py_path) >>>>>>>>>>>> import petsc4py >>>>>>>>>>>> petsc4py.init(sys.argv) >>>>>>>>>>>> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0) >>>>>>>>>>>> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle) >>>>>>>>>>>> print('CUDA memory after PETSc %.3fGB' % (info.used/1e9)) >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >