> On Feb 13, 2020, at 7:39 AM, Smith, Barry F. <[email protected]> wrote: > > > How are the two being compiled and linked? The same way, one with the PETSc > library in the path and the other without? Or does the PETSc one have lots of > flags and stuff while the non-PETSc one is just simple by hand?
PETSc was built into a static lib. Then both of the two example were built with the static lib. Hong > > Barry > > >> On Feb 12, 2020, at 7:29 PM, Zhang, Hong <[email protected]> wrote: >> >> >> >>> On Feb 12, 2020, at 5:11 PM, Smith, Barry F. <[email protected]> wrote: >>> >>> >>> ldd -o on the petsc program (static) and the non petsc program (static), >>> what are the differences? >> >> There is no difference in the outputs. >> >>> >>> nm -o both executables | grep cudaFree() >> >> Non petsc program: >> >> [[email protected] tests]$ nm ex_simple | grep cudaFree >> 0000000010000ae0 t 00000017.plt_call.cudaFree@@libcudart.so.10.1 >> U cudaFree@@libcudart.so.10.1 >> >> Petsc program: >> >> [[email protected] tests]$ nm ex_simple_petsc | grep cudaFree >> 0000000010016550 t 00000017.plt_call.cudaFree@@libcudart.so.10.1 >> 0000000010017010 t 00000017.plt_call.cudaFreeHost@@libcudart.so.10.1 >> 00000000124c3f48 V >> _ZGVZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resou >> rceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_ >> 8cuda_cub7pointerIvEEEEEEEEPT_vE8resource >> 00000000124c3f50 V >> _ZGVZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda >> _memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8r >> esource >> 0000000010726788 W >> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE11do_allocateEmm >> 00000000107267e8 W >> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE13do_deallocateENS_10device_ptrIvEEmm >> 0000000010726878 W >> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED0Ev >> 0000000010726848 W >> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED1Ev >> 0000000010729f78 W >> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE11do_allocateEmm >> 000000001072a218 W >> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE13do_deallocateES6_mm >> 000000001072a388 W >> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED0Ev >> 000000001072a358 W >> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED1Ev >> 0000000012122300 V >> _ZTIN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE >> 0000000012122370 V >> _ZTIN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE >> 0000000012122410 V >> _ZTSN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE >> 00000000121225f0 V >> _ZTSN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE >> 0000000012120630 V >> _ZTVN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE >> 00000000121205b0 V >> _ZTVN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE >> 00000000124c3f30 V >> _ZZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEEEPT_vE8resource >> 00000000124c3f20 V >> _ZZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8resource >> U cudaFree@@libcudart.so.10.1 >> U cudaFreeHost@@libcudart.so.10.1 >> >> Hong >> >>> >>> >>> >>> >>> >>>> On Feb 12, 2020, at 1:51 PM, Munson, Todd via petsc-dev >>>> <[email protected]> wrote: >>>> >>>> >>>> There are some side effects when loading shared libraries, such as >>>> initializations of >>>> static variables, etc. Is something like that happening? >>>> >>>> Another place is the initial runtime library that gets linked (libcrt0 >>>> maybe?). I >>>> think some MPI compilers insert their own version. >>>> >>>> Todd. >>>> >>>>> On Feb 12, 2020, at 11:38 AM, Zhang, Hong via petsc-dev >>>>> <[email protected]> wrote: >>>>> >>>>> >>>>> >>>>>> On Feb 12, 2020, at 11:09 AM, Matthew Knepley <[email protected]> wrote: >>>>>> >>>>>> On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev >>>>>> <[email protected]> wrote: >>>>>> Sorry for the long post. Here are replies I have got from OLCF so far. >>>>>> We still don’t know how to solve the problem. >>>>>> >>>>>> One interesting thing that Tom noticed is PetscInitialize() may have >>>>>> called cudaFree(0) 32 times as NVPROF shows, and they all run very fast. >>>>>> These calls may be triggered by some other libraries like cublas. But if >>>>>> PETSc calls cudaFree() explicitly, it is always very slow. >>>>>> >>>>>> It sounds really painful, but I would start removing lines from >>>>>> PetscInitialize() until it runs fast. >>>>> >>>>> It may be more painful than it sounds. The problem is not really related >>>>> to PetscInitialize(). In the following simple example, we do not call any >>>>> PETsc function. But if we link it to the PETSc shared library, >>>>> cudaFree(0) would be very slow. CUDA is a blackbox. There is not much we >>>>> can debug with this simple example. >>>>> >>>>> bash-4.2$ cat ex_simple.c >>>>> #include <time.h> >>>>> #include <cuda_runtime.h> >>>>> #include <stdio.h> >>>>> >>>>> int main(int argc,char **args) >>>>> { >>>>> clock_t start,s1,s2,s3; >>>>> double cputime; >>>>> double *init,tmp[100] = {0}; >>>>> >>>>> start = clock(); >>>>> cudaFree(0); >>>>> s1 = clock(); >>>>> cudaMalloc((void **)&init,100*sizeof(double)); >>>>> s2 = clock(); >>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice); >>>>> s3 = clock(); >>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - >>>>> start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) >>>>> (s3 - s2)) / CLOCKS_PER_SEC); >>>>> return 0; >>>>> } >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Matt >>>>>> >>>>>> Hong >>>>>> >>>>>> >>>>>> On Wed Feb 12 09:51:33 2020, tpapathe wrote: >>>>>> >>>>>> Something else I noticed from the nvprof output (see my previous post) is >>>>>> that the runs with PETSc initialized have 33 calls to cudaFree, whereas >>>>>> the >>>>>> non-PETSc versions only have the 1 call to cudaFree. I'm not sure what is >>>>>> happening in the PETSc initialize/finalize, but it appears to be doing a >>>>>> lot under the hood. You can also see there are many additional CUDA calls >>>>>> that are not shown in the profiler output from the non-PETSc runs (e.g., >>>>>> additional cudaMalloc and cudaMemcpy calls, cudaDeviceSychronize, etc.). >>>>>> Which other systems have you tested this on? Which CUDA Toolkits and CUDA >>>>>> drivers were installed on those systems? Please let me know if there is >>>>>> any >>>>>> additional information you can share with me about this. >>>>>> >>>>>> -Tom >>>>>> On Wed Feb 12 09:25:23 2020, tpapathe wrote: >>>>>> >>>>>> Ok. Thanks for the additional info, Hong. I'll ask around to see if any >>>>>> local (PETSc or CUDA) experts have experienced this behavior. In the >>>>>> meantime, is this impacting your work or something you're just curious >>>>>> about? A 5-7 second initialization time is indeed unusual, but is it >>>>>> negligible relative to the overall walltime of your jobs, or is it >>>>>> somehow affecting your productivity? >>>>>> >>>>>> -Tom >>>>>> On Tue Feb 11 17:04:25 2020, [email protected] wrote: >>>>>> >>>>>> We know it happens with PETSc. But note that the slow down occurs on the >>>>>> first CUDA function call. In the example I sent to you, if we simply >>>>>> link it to the PETSc shared library and don’t call any PETSc function, >>>>>> the slow down still happens on cudaFree(0). We have never seen this >>>>>> behavior on other GPU systems. >>>>>> >>>>>> On Feb 11, 2020, at 3:31 PM, Thomas Papatheodore via RT <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Thanks for the update. I have now reproduced the behavior you described >>>>>> with >>>>>> PETSc + CUDA using your example code: >>>>>> >>>>>> [tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun >>>>>> -n1 >>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc >>>>>> >>>>>> ==16991== NVPROF is profiling process 16991, command: >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc >>>>>> >>>>>> ==16991== Profiling application: >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc >>>>>> >>>>>> free time =4.730000 malloc time =0.000000 copy time =0.000000 >>>>>> >>>>>> ==16991== Profiling result: >>>>>> >>>>>> Type Time(%) Time Calls Avg Min Max Name >>>>>> >>>>>> GPU activities: 100.00% 9.3760us 6 1.5620us 1.3440us 1.7920us [CUDA >>>>>> memcpy >>>>>> HtoD] >>>>>> >>>>>> API calls: 99.78% 5.99333s 33 181.62ms 883ns 4.71976s cudaFree >>>>>> >>>>>> 0.11% 6.3603ms 379 16.781us 233ns 693.40us cuDeviceGetAttribute >>>>>> >>>>>> 0.07% 4.1453ms 4 1.0363ms 1.0186ms 1.0623ms cuDeviceTotalMem >>>>>> >>>>>> 0.02% 1.0046ms 4 251.15us 131.45us 449.32us cuDeviceGetName >>>>>> >>>>>> 0.01% 808.21us 16 50.513us 6.7080us 621.54us cudaMalloc >>>>>> >>>>>> 0.01% 452.06us 450 1.0040us 830ns 6.4430us cudaFuncSetAttribute >>>>>> >>>>>> 0.00% 104.89us 6 17.481us 13.419us 21.338us cudaMemcpy >>>>>> >>>>>> 0.00% 102.26us 15 6.8170us 6.1900us 10.072us cudaDeviceSynchronize >>>>>> >>>>>> 0.00% 93.635us 80 1.1700us 1.0190us 2.1990us cudaEventCreateWithFlags >>>>>> >>>>>> 0.00% 92.168us 83 1.1100us 951ns 2.3550us cudaEventDestroy >>>>>> >>>>>> 0.00% 52.277us 74 706ns 592ns 1.5640us cudaDeviceGetAttribute >>>>>> >>>>>> 0.00% 34.558us 3 11.519us 9.5410us 15.129us cudaStreamDestroy >>>>>> >>>>>> 0.00% 27.778us 3 9.2590us 4.9120us 17.632us cudaStreamCreateWithFlags >>>>>> >>>>>> 0.00% 11.955us 1 11.955us 11.955us 11.955us cudaSetDevice >>>>>> >>>>>> 0.00% 10.361us 7 1.4800us 809ns 3.6580us cudaGetDevice >>>>>> >>>>>> 0.00% 5.4310us 3 1.8100us 1.6420us 1.9980us cudaEventCreate >>>>>> >>>>>> 0.00% 3.8040us 6 634ns 391ns 1.5350us cuDeviceGetCount >>>>>> >>>>>> 0.00% 3.5350us 1 3.5350us 3.5350us 3.5350us cuDeviceGetPCIBusId >>>>>> >>>>>> 0.00% 3.2210us 3 1.0730us 949ns 1.1640us cuInit >>>>>> >>>>>> 0.00% 2.6780us 5 535ns 369ns 1.0210us cuDeviceGet >>>>>> >>>>>> 0.00% 2.5080us 1 2.5080us 2.5080us 2.5080us cudaSetDeviceFlags >>>>>> >>>>>> 0.00% 1.6800us 4 420ns 392ns 488ns cuDeviceGetUuid >>>>>> >>>>>> 0.00% 1.5720us 3 524ns 398ns 590ns cuDriverGetVersion >>>>>> >>>>>> >>>>>> >>>>>> If I remove all mention of PETSc from the code, compile manually and >>>>>> run, I get >>>>>> the expected behavior: >>>>>> >>>>>> [tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ pgc++ >>>>>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple.c -o ex_simple >>>>>> >>>>>> >>>>>> [tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun >>>>>> -n1 >>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple >>>>>> >>>>>> ==17248== NVPROF is profiling process 17248, command: >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple >>>>>> >>>>>> ==17248== Profiling application: >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple >>>>>> >>>>>> free time =0.340000 malloc time =0.000000 copy time =0.000000 >>>>>> >>>>>> ==17248== Profiling result: >>>>>> >>>>>> Type Time(%) Time Calls Avg Min Max Name >>>>>> >>>>>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA >>>>>> memcpy >>>>>> HtoD] >>>>>> >>>>>> API calls: 98.56% 231.76ms 1 231.76ms 231.76ms 231.76ms cudaFree >>>>>> >>>>>> 0.67% 1.5764ms 97 16.251us 234ns 652.65us cuDeviceGetAttribute >>>>>> >>>>>> 0.46% 1.0727ms 1 1.0727ms 1.0727ms 1.0727ms cuDeviceTotalMem >>>>>> >>>>>> 0.23% 537.38us 1 537.38us 537.38us 537.38us cudaMalloc >>>>>> >>>>>> 0.07% 172.80us 1 172.80us 172.80us 172.80us cuDeviceGetName >>>>>> >>>>>> 0.01% 21.648us 1 21.648us 21.648us 21.648us cudaMemcpy >>>>>> >>>>>> 0.00% 3.3470us 1 3.3470us 3.3470us 3.3470us cuDeviceGetPCIBusId >>>>>> >>>>>> 0.00% 2.5310us 3 843ns 464ns 1.3700us cuDeviceGetCount >>>>>> >>>>>> 0.00% 1.7260us 2 863ns 490ns 1.2360us cuDeviceGet >>>>>> >>>>>> 0.00% 377ns 1 377ns 377ns 377ns cuDeviceGetUuid >>>>>> >>>>>> >>>>>> >>>>>> I also get the expected behavior if I add an MPI_Init and MPI_Finalize >>>>>> to the >>>>>> code instead of PETSc initialization: >>>>>> >>>>>> [tpapathe@login1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ mpicc >>>>>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple_mpi.c -o ex_simple_mpi >>>>>> >>>>>> >>>>>> [tpapathe@batch1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun >>>>>> -n1 >>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi >>>>>> >>>>>> ==35166== NVPROF is profiling process 35166, command: >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi >>>>>> >>>>>> ==35166== Profiling application: >>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi >>>>>> >>>>>> free time =0.340000 malloc time =0.000000 copy time =0.000000 >>>>>> >>>>>> ==35166== Profiling result: >>>>>> >>>>>> Type Time(%) Time Calls Avg Min Max Name >>>>>> >>>>>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA >>>>>> memcpy >>>>>> HtoD] >>>>>> >>>>>> API calls: 98.57% 235.61ms 1 235.61ms 235.61ms 235.61ms cudaFree >>>>>> >>>>>> 0.66% 1.5802ms 97 16.290us 239ns 650.72us cuDeviceGetAttribute >>>>>> >>>>>> 0.45% 1.0825ms 1 1.0825ms 1.0825ms 1.0825ms cuDeviceTotalMem >>>>>> >>>>>> 0.23% 542.73us 1 542.73us 542.73us 542.73us cudaMalloc >>>>>> >>>>>> 0.07% 174.77us 1 174.77us 174.77us 174.77us cuDeviceGetName >>>>>> >>>>>> 0.01% 26.431us 1 26.431us 26.431us 26.431us cudaMemcpy >>>>>> >>>>>> 0.00% 4.0330us 1 4.0330us 4.0330us 4.0330us cuDeviceGetPCIBusId >>>>>> >>>>>> 0.00% 2.8560us 3 952ns 528ns 1.6150us cuDeviceGetCount >>>>>> >>>>>> 0.00% 1.6190us 2 809ns 576ns 1.0430us cuDeviceGet >>>>>> >>>>>> 0.00% 341ns 1 341ns 341ns 341ns cuDeviceGetUuid >>>>>> >>>>>> >>>>>> So this appears to be something specific happening within PETSc itself - >>>>>> not >>>>>> necessarily an OLCF issue. I would suggest asking this question within >>>>>> the >>>>>> PETSc community to understand what's happening. Please let me know if >>>>>> you have >>>>>> any additional questions. >>>>>> >>>>>> -Tom >>>>>> >>>>>>> On Feb 10, 2020, at 11:14 AM, Smith, Barry F. <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> gprof or some similar tool? >>>>>>> >>>>>>> >>>>>>>> On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> -cuda_initialize 0 does not make any difference. Actually this issue >>>>>>>> has nothing to do with PetscInitialize(). I tried to call cudaFree(0) >>>>>>>> before PetscInitialize(), and it still took 7.5 seconds. >>>>>>>> >>>>>>>> Hong >>>>>>>> >>>>>>>>> On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> As I mentioned, have you tried -cuda_initialize 0? Also, >>>>>>>>> PetscCUDAInitialize contains >>>>>>>>> ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr); >>>>>>>>> ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr); >>>>>>>>> Have you tried to comment out them and test again? >>>>>>>>> --Junchao Zhang >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev >>>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Feb 8, 2020, at 5:03 PM, Matthew Knepley <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>> I did some further investigation. The overhead persists for both the >>>>>>>>>> PETSc shared library and the static library. In the previous >>>>>>>>>> example, it does not call any PETSc function, the first CUDA >>>>>>>>>> function becomes very slow when it is linked to the petsc so. This >>>>>>>>>> indicates that the slowdown occurs if the symbol (cudafree)is >>>>>>>>>> searched through the petsc so, but does not occur if the symbol is >>>>>>>>>> found directly in the cuda runtime lib. >>>>>>>>>> >>>>>>>>>> So the issue has nothing to do with the dynamic linker. The >>>>>>>>>> following example can be used to easily reproduce the problem >>>>>>>>>> (cudaFree(0) always takes ~7.5 seconds). >>>>>>>>>> >>>>>>>>>> 1) This should go to OLCF admin as Jeff suggests >>>>>>>>> >>>>>>>>> I had sent this to OLCF admin before the discussion was started here. >>>>>>>>> Thomas Papatheodore has followed up. I am trying to help him >>>>>>>>> reproduce the problem on summit. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2) Just to make sure I understand, a static executable with this >>>>>>>>>> code is still slow on the cudaFree(), since CUDA is a shared library >>>>>>>>>> by default. >>>>>>>>> >>>>>>>>> I prepared the code as a minimal example to reproduce the problem. It >>>>>>>>> would be fair to say any code using PETSc (with CUDA enabled, built >>>>>>>>> statically or dynamically) on summit suffers a 7.5-second overhead on >>>>>>>>> the first CUDA function call (either in the user code or inside >>>>>>>>> PETSc). >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Hong >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think we should try: >>>>>>>>>> >>>>>>>>>> a) Forcing a full static link, if possible >>>>>>>>>> >>>>>>>>>> b) Asking OLCF about link resolution order >>>>>>>>>> >>>>>>>>>> It sounds like a similar thing I have seen in the past where link >>>>>>>>>> resolution order can exponentially increase load time. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Matt >>>>>>>>>> >>>>>>>>>> bash-4.2$ cat ex_simple_petsc.c >>>>>>>>>> #include <time.h> >>>>>>>>>> #include <cuda_runtime.h> >>>>>>>>>> #include <stdio.h> >>>>>>>>>> #include <petscmat.h> >>>>>>>>>> >>>>>>>>>> int main(int argc,char **args) >>>>>>>>>> { >>>>>>>>>> clock_t start,s1,s2,s3; >>>>>>>>>> double cputime; >>>>>>>>>> double *init,tmp[100] = {0}; >>>>>>>>>> PetscErrorCode ierr=0; >>>>>>>>>> >>>>>>>>>> ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return >>>>>>>>>> ierr; >>>>>>>>>> start = clock(); >>>>>>>>>> cudaFree(0); >>>>>>>>>> s1 = clock(); >>>>>>>>>> cudaMalloc((void **)&init,100*sizeof(double)); >>>>>>>>>> s2 = clock(); >>>>>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice); >>>>>>>>>> s3 = clock(); >>>>>>>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) >>>>>>>>>> (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / >>>>>>>>>> CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC); >>>>>>>>>> ierr = PetscFinalize(); >>>>>>>>>> return ierr; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Hong >>>>>>>>>> >>>>>>>>>>> On Feb 7, 2020, at 3:09 PM, Zhang, Hong <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> Note that the overhead was triggered by the first call to a CUDA >>>>>>>>>>> function. So it seems that the first CUDA function triggered >>>>>>>>>>> loading petsc so (if petsc so is linked), which is slow on the >>>>>>>>>>> summit file system. >>>>>>>>>>> >>>>>>>>>>> Hong >>>>>>>>>>> >>>>>>>>>>>> On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev >>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Linking any other shared library does not slow down the execution. >>>>>>>>>>>> The PETSc shared library is the only one causing trouble. >>>>>>>>>>>> >>>>>>>>>>>> Here are the ldd output for two different versions. For the first >>>>>>>>>>>> version, I removed -lpetsc and it ran very fast. The second (slow) >>>>>>>>>>>> version was linked to petsc so. >>>>>>>>>>>> >>>>>>>>>>>> bash-4.2$ ldd ex_simple >>>>>>>>>>>> linux-vdso64.so.1 => (0x0000200000050000) >>>>>>>>>>>> liblapack.so.0 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 >>>>>>>>>>>> (0x0000200000070000) >>>>>>>>>>>> libblas.so.0 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 >>>>>>>>>>>> (0x00002000009b0000) >>>>>>>>>>>> libhdf5hl_fortran.so.100 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 >>>>>>>>>>>> (0x0000200000e80000) >>>>>>>>>>>> libhdf5_fortran.so.100 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 >>>>>>>>>>>> (0x0000200000ed0000) >>>>>>>>>>>> libhdf5_hl.so.100 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 >>>>>>>>>>>> (0x0000200000f50000) >>>>>>>>>>>> libhdf5.so.103 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 >>>>>>>>>>>> (0x0000200000fb0000) >>>>>>>>>>>> libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000) >>>>>>>>>>>> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 >>>>>>>>>>>> (0x0000200001770000) >>>>>>>>>>>> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 >>>>>>>>>>>> (0x0000200009b00000) >>>>>>>>>>>> libcudart.so.10.1 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 >>>>>>>>>>>> (0x000020000d950000) >>>>>>>>>>>> libcusparse.so.10 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 >>>>>>>>>>>> (0x000020000d9f0000) >>>>>>>>>>>> libcusolver.so.10 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 >>>>>>>>>>>> (0x0000200012f50000) >>>>>>>>>>>> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000) >>>>>>>>>>>> libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000) >>>>>>>>>>>> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000) >>>>>>>>>>>> libmpiprofilesupport.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 >>>>>>>>>>>> (0x000020001de40000) >>>>>>>>>>>> libmpi_ibm_usempi.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so >>>>>>>>>>>> (0x000020001de70000) >>>>>>>>>>>> libmpi_ibm_mpifh.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 >>>>>>>>>>>> (0x000020001dea0000) >>>>>>>>>>>> libmpi_ibm.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 >>>>>>>>>>>> (0x000020001df40000) >>>>>>>>>>>> libpgf90rtl.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so >>>>>>>>>>>> (0x000020001e0b0000) >>>>>>>>>>>> libpgf90.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so >>>>>>>>>>>> (0x000020001e0f0000) >>>>>>>>>>>> libpgf90_rpm1.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so >>>>>>>>>>>> (0x000020001e6a0000) >>>>>>>>>>>> libpgf902.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so >>>>>>>>>>>> (0x000020001e6d0000) >>>>>>>>>>>> libpgftnrtl.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so >>>>>>>>>>>> (0x000020001e700000) >>>>>>>>>>>> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000) >>>>>>>>>>>> libpgkomp.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so >>>>>>>>>>>> (0x000020001e760000) >>>>>>>>>>>> libomp.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so >>>>>>>>>>>> (0x000020001e790000) >>>>>>>>>>>> libomptarget.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so >>>>>>>>>>>> (0x000020001e880000) >>>>>>>>>>>> libpgmath.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so >>>>>>>>>>>> (0x000020001e8b0000) >>>>>>>>>>>> libpgc.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so >>>>>>>>>>>> (0x000020001e9d0000) >>>>>>>>>>>> librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000) >>>>>>>>>>>> libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000) >>>>>>>>>>>> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000) >>>>>>>>>>>> libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000) >>>>>>>>>>>> libz.so.1 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 >>>>>>>>>>>> (0x000020001ee90000) >>>>>>>>>>>> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000) >>>>>>>>>>>> /lib64/ld64.so.2 (0x0000200000000000) >>>>>>>>>>>> libcublasLt.so.10 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 >>>>>>>>>>>> (0x000020001ef40000) >>>>>>>>>>>> libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000) >>>>>>>>>>>> libhwloc_ompi.so.15 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 >>>>>>>>>>>> (0x0000200020e80000) >>>>>>>>>>>> libevent-2.1.so.6 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 >>>>>>>>>>>> (0x0000200020ef0000) >>>>>>>>>>>> libevent_pthreads-2.1.so.6 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 >>>>>>>>>>>> (0x0000200020f70000) >>>>>>>>>>>> libopen-rte.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 >>>>>>>>>>>> (0x0000200020fa0000) >>>>>>>>>>>> libopen-pal.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 >>>>>>>>>>>> (0x00002000210b0000) >>>>>>>>>>>> libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> bash-4.2$ ldd ex_simple_slow >>>>>>>>>>>> linux-vdso64.so.1 => (0x0000200000050000) >>>>>>>>>>>> libpetsc.so.3.012 => >>>>>>>>>>>> /autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012 >>>>>>>>>>>> (0x0000200000070000) >>>>>>>>>>>> liblapack.so.0 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 >>>>>>>>>>>> (0x0000200002be0000) >>>>>>>>>>>> libblas.so.0 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 >>>>>>>>>>>> (0x0000200003520000) >>>>>>>>>>>> libhdf5hl_fortran.so.100 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 >>>>>>>>>>>> (0x00002000039f0000) >>>>>>>>>>>> libhdf5_fortran.so.100 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 >>>>>>>>>>>> (0x0000200003a40000) >>>>>>>>>>>> libhdf5_hl.so.100 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 >>>>>>>>>>>> (0x0000200003ac0000) >>>>>>>>>>>> libhdf5.so.103 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 >>>>>>>>>>>> (0x0000200003b20000) >>>>>>>>>>>> libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000) >>>>>>>>>>>> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 >>>>>>>>>>>> (0x00002000042e0000) >>>>>>>>>>>> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 >>>>>>>>>>>> (0x000020000c670000) >>>>>>>>>>>> libcudart.so.10.1 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 >>>>>>>>>>>> (0x00002000104c0000) >>>>>>>>>>>> libcusparse.so.10 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 >>>>>>>>>>>> (0x0000200010560000) >>>>>>>>>>>> libcusolver.so.10 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 >>>>>>>>>>>> (0x0000200015ac0000) >>>>>>>>>>>> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000) >>>>>>>>>>>> libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000) >>>>>>>>>>>> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000) >>>>>>>>>>>> libmpiprofilesupport.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 >>>>>>>>>>>> (0x00002000209b0000) >>>>>>>>>>>> libmpi_ibm_usempi.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so >>>>>>>>>>>> (0x00002000209e0000) >>>>>>>>>>>> libmpi_ibm_mpifh.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 >>>>>>>>>>>> (0x0000200020a10000) >>>>>>>>>>>> libmpi_ibm.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 >>>>>>>>>>>> (0x0000200020ab0000) >>>>>>>>>>>> libpgf90rtl.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so >>>>>>>>>>>> (0x0000200020c20000) >>>>>>>>>>>> libpgf90.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so >>>>>>>>>>>> (0x0000200020c60000) >>>>>>>>>>>> libpgf90_rpm1.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so >>>>>>>>>>>> (0x0000200021210000) >>>>>>>>>>>> libpgf902.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so >>>>>>>>>>>> (0x0000200021240000) >>>>>>>>>>>> libpgftnrtl.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so >>>>>>>>>>>> (0x0000200021270000) >>>>>>>>>>>> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000) >>>>>>>>>>>> libpgkomp.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so >>>>>>>>>>>> (0x00002000212d0000) >>>>>>>>>>>> libomp.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so >>>>>>>>>>>> (0x0000200021300000) >>>>>>>>>>>> libomptarget.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so >>>>>>>>>>>> (0x00002000213f0000) >>>>>>>>>>>> libpgmath.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so >>>>>>>>>>>> (0x0000200021420000) >>>>>>>>>>>> libpgc.so => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so >>>>>>>>>>>> (0x0000200021540000) >>>>>>>>>>>> librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000) >>>>>>>>>>>> libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000) >>>>>>>>>>>> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000) >>>>>>>>>>>> libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000) >>>>>>>>>>>> libz.so.1 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 >>>>>>>>>>>> (0x0000200021a10000) >>>>>>>>>>>> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000) >>>>>>>>>>>> /lib64/ld64.so.2 (0x0000200000000000) >>>>>>>>>>>> libcublasLt.so.10 => >>>>>>>>>>>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 >>>>>>>>>>>> (0x0000200021ab0000) >>>>>>>>>>>> libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000) >>>>>>>>>>>> libhwloc_ompi.so.15 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 >>>>>>>>>>>> (0x00002000239f0000) >>>>>>>>>>>> libevent-2.1.so.6 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 >>>>>>>>>>>> (0x0000200023a60000) >>>>>>>>>>>> libevent_pthreads-2.1.so.6 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 >>>>>>>>>>>> (0x0000200023ae0000) >>>>>>>>>>>> libopen-rte.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 >>>>>>>>>>>> (0x0000200023b10000) >>>>>>>>>>>> libopen-pal.so.3 => >>>>>>>>>>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 >>>>>>>>>>>> (0x0000200023c20000) >>>>>>>>>>>> libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ldd -o on the executable of both linkings of your code. >>>>>>>>>>>>> >>>>>>>>>>>>> My guess is that without PETSc it is linking the static version >>>>>>>>>>>>> of the needed libraries and with PETSc the shared. And, in >>>>>>>>>>>>> typical fashion, the shared libraries are off on some super slow >>>>>>>>>>>>> file system so take a long time to be loaded and linked in on >>>>>>>>>>>>> demand. >>>>>>>>>>>>> >>>>>>>>>>>>> Still a performance bug in Summit. >>>>>>>>>>>>> >>>>>>>>>>>>> Barry >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev >>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Previously I have noticed that the first call to a CUDA function >>>>>>>>>>>>>> such as cudaMalloc and cudaFree in PETSc takes a long time (7.5 >>>>>>>>>>>>>> seconds) on summit. Then I prepared a simple example as attached >>>>>>>>>>>>>> to help OCLF reproduce the problem. It turned out that the >>>>>>>>>>>>>> problem was caused by PETSc. The 7.5-second overhead can be >>>>>>>>>>>>>> observed only when the PETSc lib is linked. If I do not link >>>>>>>>>>>>>> PETSc, it runs normally. Does anyone have any idea why this >>>>>>>>>>>>>> happens and how to fix it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hong (Mr.) >>>>>>>>>>>>>> >>>>>>>>>>>>>> bash-4.2$ cat ex_simple.c >>>>>>>>>>>>>> #include <time.h> >>>>>>>>>>>>>> #include <cuda_runtime.h> >>>>>>>>>>>>>> #include <stdio.h> >>>>>>>>>>>>>> >>>>>>>>>>>>>> int main(int argc,char **args) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> clock_t start,s1,s2,s3; >>>>>>>>>>>>>> double cputime; >>>>>>>>>>>>>> double *init,tmp[100] = {0}; >>>>>>>>>>>>>> >>>>>>>>>>>>>> start = clock(); >>>>>>>>>>>>>> cudaFree(0); >>>>>>>>>>>>>> s1 = clock(); >>>>>>>>>>>>>> cudaMalloc((void **)&init,100*sizeof(double)); >>>>>>>>>>>>>> s2 = clock(); >>>>>>>>>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice); >>>>>>>>>>>>>> s3 = clock(); >>>>>>>>>>>>>> printf("free time =%lf malloc time =%lf copy time >>>>>>>>>>>>>> =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - >>>>>>>>>>>>>> s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC); >>>>>>>>>>>>>> >>>>>>>>>>>>>> return 0; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>>> experiments is infinitely more interesting than any results to which >>>>>>>>>> their experiments lead. >>>>>>>>>> -- Norbert Wiener >>>>>>>>>> >>>>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to which >>>>>> their experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>> >>> >> >
