Hi Hong,

have you tried running the code through gprof and look at the output (e.g. with kcachegrind)?

(apologies if this has been suggested already)

Best regards,
Karli



On 2/12/20 7:29 PM, Zhang, Hong via petsc-dev wrote:


On Feb 12, 2020, at 5:11 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:


  ldd -o on the petsc program (static) and the non petsc program (static), what 
are the differences?

There is no difference in the outputs.


  nm -o both executables | grep cudaFree()

Non petsc program:

[hongzh@login3.summit tests]$ nm ex_simple | grep cudaFree
0000000010000ae0 t 00000017.plt_call.cudaFree@@libcudart.so.10.1
                  U cudaFree@@libcudart.so.10.1

Petsc program:

[hongzh@login3.summit tests]$ nm ex_simple_petsc | grep cudaFree
0000000010016550 t 00000017.plt_call.cudaFree@@libcudart.so.10.1
0000000010017010 t 00000017.plt_call.cudaFreeHost@@libcudart.so.10.1
00000000124c3f48 V 
_ZGVZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resou
rceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_
8cuda_cub7pointerIvEEEEEEEEPT_vE8resource
00000000124c3f50 V 
_ZGVZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda
_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8r
esource
0000000010726788 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE11do_allocateEmm
00000000107267e8 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE13do_deallocateENS_10device_ptrIvEEmm
0000000010726878 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED0Ev
0000000010726848 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED1Ev
0000000010729f78 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE11do_allocateEmm
000000001072a218 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE13do_deallocateES6_mm
000000001072a388 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED0Ev
000000001072a358 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED1Ev
0000000012122300 V 
_ZTIN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
0000000012122370 V 
_ZTIN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
0000000012122410 V 
_ZTSN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
00000000121225f0 V 
_ZTSN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
0000000012120630 V 
_ZTVN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
00000000121205b0 V 
_ZTVN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
00000000124c3f30 V 
_ZZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEEEPT_vE8resource
00000000124c3f20 V 
_ZZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8resource
                  U cudaFree@@libcudart.so.10.1
                  U cudaFreeHost@@libcudart.so.10.1

Hong






On Feb 12, 2020, at 1:51 PM, Munson, Todd via petsc-dev <petsc-dev@mcs.anl.gov> 
wrote:


There are some side effects when loading shared libraries, such as 
initializations of
static variables, etc.  Is something like that happening?

Another place is the initial runtime library that gets linked (libcrt0 maybe?). 
 I
think some MPI compilers insert their own version.

Todd.

On Feb 12, 2020, at 11:38 AM, Zhang, Hong via petsc-dev <petsc-dev@mcs.anl.gov> 
wrote:



On Feb 12, 2020, at 11:09 AM, Matthew Knepley <knep...@gmail.com> wrote:

On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev 
<petsc-dev@mcs.anl.gov> wrote:
Sorry for the long post. Here are replies I have got from OLCF so far. We still 
don’t know how to solve the problem.

One interesting thing that Tom noticed is PetscInitialize() may have called 
cudaFree(0) 32 times as NVPROF shows, and they all run very fast. These calls 
may be triggered by some other libraries like cublas. But if PETSc calls 
cudaFree() explicitly, it is always very slow.

It sounds really painful, but I would start removing lines from 
PetscInitialize() until it runs fast.

It may be more painful than it sounds. The problem is not really related to 
PetscInitialize(). In the following simple example, we do not call any PETsc 
function. But if we link it to the PETSc shared library, cudaFree(0) would be 
very slow. CUDA is a blackbox. There is not much we can debug with this simple 
example.

bash-4.2$ cat ex_simple.c
#include <time.h>
#include <cuda_runtime.h>
#include <stdio.h>

int main(int argc,char **args)
{
clock_t start,s1,s2,s3;
double  cputime;
double   *init,tmp[100] = {0};

start = clock();
cudaFree(0);
s1 = clock();
cudaMalloc((void **)&init,100*sizeof(double));
s2 = clock();
cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
s3 = clock();
printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / 
CLOCKS_PER_SEC);
return 0;
}



Thanks,

    Matt

Hong


On Wed Feb 12 09:51:33 2020, tpapathe wrote:

Something else I noticed from the nvprof output (see my previous post) is
that the runs with PETSc initialized have 33 calls to cudaFree, whereas the
non-PETSc versions only have the 1 call to cudaFree. I'm not sure what is
happening in the PETSc initialize/finalize, but it appears to be doing a
lot under the hood. You can also see there are many additional CUDA calls
that are not shown in the profiler output from the non-PETSc runs (e.g.,
additional cudaMalloc and cudaMemcpy calls, cudaDeviceSychronize, etc.).
Which other systems have you tested this on? Which CUDA Toolkits and CUDA
drivers were installed on those systems? Please let me know if there is any
additional information you can share with me about this.

-Tom
On Wed Feb 12 09:25:23 2020, tpapathe wrote:

  Ok. Thanks for the additional info, Hong. I'll ask around to see if any
  local (PETSc or CUDA) experts have experienced this behavior. In the
  meantime, is this impacting your work or something you're just curious
  about? A 5-7 second initialization time is indeed unusual, but is it
  negligible relative to the overall walltime of your jobs, or is it
  somehow affecting your productivity?

  -Tom
  On Tue Feb 11 17:04:25 2020, hongzh...@anl.gov wrote:

    We know it happens with PETSc. But note that the slow down occurs on the 
first CUDA function call. In the example I sent to you, if we simply link it to 
the PETSc shared library and don’t call any PETSc function, the slow down still 
happens on cudaFree(0). We have never seen this behavior on other GPU systems.

On Feb 11, 2020, at 3:31 PM, Thomas Papatheodore via RT <h...@nccs.gov> wrote:

Thanks for the update. I have now reproduced the behavior you described with
PETSc + CUDA using your example code:

[tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc

==16991== NVPROF is profiling process 16991, command:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc

==16991== Profiling application:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc

free time =4.730000 malloc time =0.000000 copy time =0.000000

==16991== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 9.3760us 6 1.5620us 1.3440us 1.7920us [CUDA memcpy
HtoD]

API calls: 99.78% 5.99333s 33 181.62ms 883ns 4.71976s cudaFree

0.11% 6.3603ms 379 16.781us 233ns 693.40us cuDeviceGetAttribute

0.07% 4.1453ms 4 1.0363ms 1.0186ms 1.0623ms cuDeviceTotalMem

0.02% 1.0046ms 4 251.15us 131.45us 449.32us cuDeviceGetName

0.01% 808.21us 16 50.513us 6.7080us 621.54us cudaMalloc

0.01% 452.06us 450 1.0040us 830ns 6.4430us cudaFuncSetAttribute

0.00% 104.89us 6 17.481us 13.419us 21.338us cudaMemcpy

0.00% 102.26us 15 6.8170us 6.1900us 10.072us cudaDeviceSynchronize

0.00% 93.635us 80 1.1700us 1.0190us 2.1990us cudaEventCreateWithFlags

0.00% 92.168us 83 1.1100us 951ns 2.3550us cudaEventDestroy

0.00% 52.277us 74 706ns 592ns 1.5640us cudaDeviceGetAttribute

0.00% 34.558us 3 11.519us 9.5410us 15.129us cudaStreamDestroy

0.00% 27.778us 3 9.2590us 4.9120us 17.632us cudaStreamCreateWithFlags

0.00% 11.955us 1 11.955us 11.955us 11.955us cudaSetDevice

0.00% 10.361us 7 1.4800us 809ns 3.6580us cudaGetDevice

0.00% 5.4310us 3 1.8100us 1.6420us 1.9980us cudaEventCreate

0.00% 3.8040us 6 634ns 391ns 1.5350us cuDeviceGetCount

0.00% 3.5350us 1 3.5350us 3.5350us 3.5350us cuDeviceGetPCIBusId

0.00% 3.2210us 3 1.0730us 949ns 1.1640us cuInit

0.00% 2.6780us 5 535ns 369ns 1.0210us cuDeviceGet

0.00% 2.5080us 1 2.5080us 2.5080us 2.5080us cudaSetDeviceFlags

0.00% 1.6800us 4 420ns 392ns 488ns cuDeviceGetUuid

0.00% 1.5720us 3 524ns 398ns 590ns cuDriverGetVersion



If I remove all mention of PETSc from the code, compile manually and run, I get
the expected behavior:

[tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ pgc++
-L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple.c -o ex_simple


[tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple

==17248== NVPROF is profiling process 17248, command:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple

==17248== Profiling application:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple

free time =0.340000 malloc time =0.000000 copy time =0.000000

==17248== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
HtoD]

API calls: 98.56% 231.76ms 1 231.76ms 231.76ms 231.76ms cudaFree

0.67% 1.5764ms 97 16.251us 234ns 652.65us cuDeviceGetAttribute

0.46% 1.0727ms 1 1.0727ms 1.0727ms 1.0727ms cuDeviceTotalMem

0.23% 537.38us 1 537.38us 537.38us 537.38us cudaMalloc

0.07% 172.80us 1 172.80us 172.80us 172.80us cuDeviceGetName

0.01% 21.648us 1 21.648us 21.648us 21.648us cudaMemcpy

0.00% 3.3470us 1 3.3470us 3.3470us 3.3470us cuDeviceGetPCIBusId

0.00% 2.5310us 3 843ns 464ns 1.3700us cuDeviceGetCount

0.00% 1.7260us 2 863ns 490ns 1.2360us cuDeviceGet

0.00% 377ns 1 377ns 377ns 377ns cuDeviceGetUuid



I also get the expected behavior if I add an MPI_Init and MPI_Finalize to the
code instead of PETSc initialization:

[tpapathe@login1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ mpicc
-L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple_mpi.c -o ex_simple_mpi


[tpapathe@batch1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi

==35166== NVPROF is profiling process 35166, command:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi

==35166== Profiling application:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi

free time =0.340000 malloc time =0.000000 copy time =0.000000

==35166== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
HtoD]

API calls: 98.57% 235.61ms 1 235.61ms 235.61ms 235.61ms cudaFree

0.66% 1.5802ms 97 16.290us 239ns 650.72us cuDeviceGetAttribute

0.45% 1.0825ms 1 1.0825ms 1.0825ms 1.0825ms cuDeviceTotalMem

0.23% 542.73us 1 542.73us 542.73us 542.73us cudaMalloc

0.07% 174.77us 1 174.77us 174.77us 174.77us cuDeviceGetName

0.01% 26.431us 1 26.431us 26.431us 26.431us cudaMemcpy

0.00% 4.0330us 1 4.0330us 4.0330us 4.0330us cuDeviceGetPCIBusId

0.00% 2.8560us 3 952ns 528ns 1.6150us cuDeviceGetCount

0.00% 1.6190us 2 809ns 576ns 1.0430us cuDeviceGet

0.00% 341ns 1 341ns 341ns 341ns cuDeviceGetUuid


So this appears to be something specific happening within PETSc itself - not
necessarily an OLCF issue. I would suggest asking this question within the
PETSc community to understand what's happening. Please let me know if you have
any additional questions.

-Tom

On Feb 10, 2020, at 11:14 AM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:


gprof or some similar tool?


On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev <petsc-dev@mcs.anl.gov> 
wrote:

-cuda_initialize 0 does not make any difference. Actually this issue has 
nothing to do with PetscInitialize(). I tried to call cudaFree(0) before 
PetscInitialize(), and it still took 7.5 seconds.

Hong

On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:

As I mentioned, have you tried -cuda_initialize 0? Also, PetscCUDAInitialize 
contains
ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);
ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);
Have you tried to comment out them and test again?
--Junchao Zhang


On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev 
<petsc-dev@mcs.anl.gov> wrote:


On Feb 8, 2020, at 5:03 PM, Matthew Knepley <knep...@gmail.com> wrote:

On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev 
<petsc-dev@mcs.anl.gov> wrote:
I did some further investigation. The overhead persists for both the PETSc 
shared library and the static library. In the previous example, it does not 
call any PETSc function, the first CUDA function becomes very slow when it is 
linked to the petsc so. This indicates that the slowdown occurs if the symbol 
(cudafree)is searched through the petsc so, but does not occur if the symbol is 
found directly in the cuda runtime lib.

So the issue has nothing to do with the dynamic linker. The following example 
can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 
seconds).

1) This should go to OLCF admin as Jeff suggests

I had sent this to OLCF admin before the discussion was started here. Thomas 
Papatheodore has followed up. I am trying to help him reproduce the problem on 
summit.


2) Just to make sure I understand, a static executable with this code is still 
slow on the cudaFree(), since CUDA is a shared library by default.

I prepared the code as a minimal example to reproduce the problem. It would be 
fair to say any code using PETSc (with CUDA enabled, built statically or 
dynamically) on summit suffers a 7.5-second overhead on the first CUDA function 
call (either in the user code or inside PETSc).

Thanks,
Hong


I think we should try:

a) Forcing a full static link, if possible

b) Asking OLCF about link resolution order

It sounds like a similar thing I have seen in the past where link resolution 
order can exponentially increase load time.

Thanks,

   Matt

bash-4.2$ cat ex_simple_petsc.c
#include <time.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <petscmat.h>

int main(int argc,char **args)
{
clock_t start,s1,s2,s3;
double  cputime;
double  *init,tmp[100] = {0};
PetscErrorCode ierr=0;

ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;
start = clock();
cudaFree(0);
s1 = clock();
cudaMalloc((void **)&init,100*sizeof(double));
s2 = clock();
cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
s3 = clock();
printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / 
CLOCKS_PER_SEC);
ierr = PetscFinalize();
return ierr;
}

Hong

On Feb 7, 2020, at 3:09 PM, Zhang, Hong <hongzh...@anl.gov> wrote:

Note that the overhead was triggered by the first call to a CUDA function. So 
it seems that the first CUDA function triggered loading petsc so (if petsc so 
is linked), which is slow on the summit file system.

Hong

On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev <petsc-dev@mcs.anl.gov> 
wrote:

Linking any other shared library does not slow down the execution. The PETSc 
shared library is the only one causing trouble.

Here are the ldd output for two different versions. For the first version, I 
removed -lpetsc and it ran very fast. The second (slow) version was linked to 
petsc so.

bash-4.2$ ldd ex_simple
      linux-vdso64.so.1 =>  (0x0000200000050000)
      liblapack.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
 (0x0000200000070000)
      libblas.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
 (0x00002000009b0000)
      libhdf5hl_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
 (0x0000200000e80000)
      libhdf5_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
 (0x0000200000ed0000)
      libhdf5_hl.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
 (0x0000200000f50000)
      libhdf5.so.103 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
 (0x0000200000fb0000)
      libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000)
      libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
(0x0000200001770000)
      libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
(0x0000200009b00000)
      libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 
(0x000020000d950000)
      libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 
(0x000020000d9f0000)
      libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 
(0x0000200012f50000)
      libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000)
      libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000)
      libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000)
      libmpiprofilesupport.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
 (0x000020001de40000)
      libmpi_ibm_usempi.so => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
 (0x000020001de70000)
      libmpi_ibm_mpifh.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
 (0x000020001dea0000)
      libmpi_ibm.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
 (0x000020001df40000)
      libpgf90rtl.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so
 (0x000020001e0b0000)
      libpgf90.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so
 (0x000020001e0f0000)
      libpgf90_rpm1.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so
 (0x000020001e6a0000)
      libpgf902.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so
 (0x000020001e6d0000)
      libpgftnrtl.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so
 (0x000020001e700000)
      libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000)
      libpgkomp.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so
 (0x000020001e760000)
      libomp.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so
 (0x000020001e790000)
      libomptarget.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so
 (0x000020001e880000)
      libpgmath.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so
 (0x000020001e8b0000)
      libpgc.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so
 (0x000020001e9d0000)
      librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000)
      libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000)
      libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000)
      libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000)
      libz.so.1 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1
 (0x000020001ee90000)
      libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000)
      /lib64/ld64.so.2 (0x0000200000000000)
      libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 
(0x000020001ef40000)
      libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000)
      libhwloc_ompi.so.15 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15
 (0x0000200020e80000)
      libevent-2.1.so.6 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6
 (0x0000200020ef0000)
      libevent_pthreads-2.1.so.6 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6
 (0x0000200020f70000)
      libopen-rte.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3
 (0x0000200020fa0000)
      libopen-pal.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3
 (0x00002000210b0000)
      libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000)


bash-4.2$ ldd ex_simple_slow
      linux-vdso64.so.1 =>  (0x0000200000050000)
      libpetsc.so.3.012 => 
/autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012
 (0x0000200000070000)
      liblapack.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
 (0x0000200002be0000)
      libblas.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
 (0x0000200003520000)
      libhdf5hl_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
 (0x00002000039f0000)
      libhdf5_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
 (0x0000200003a40000)
      libhdf5_hl.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
 (0x0000200003ac0000)
      libhdf5.so.103 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
 (0x0000200003b20000)
      libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000)
      libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
(0x00002000042e0000)
      libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
(0x000020000c670000)
      libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 
(0x00002000104c0000)
      libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 
(0x0000200010560000)
      libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 
(0x0000200015ac0000)
      libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000)
      libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000)
      libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000)
      libmpiprofilesupport.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
 (0x00002000209b0000)
      libmpi_ibm_usempi.so => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
 (0x00002000209e0000)
      libmpi_ibm_mpifh.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
 (0x0000200020a10000)
      libmpi_ibm.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
 (0x0000200020ab0000)
      libpgf90rtl.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so
 (0x0000200020c20000)
      libpgf90.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so
 (0x0000200020c60000)
      libpgf90_rpm1.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so
 (0x0000200021210000)
      libpgf902.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so
 (0x0000200021240000)
      libpgftnrtl.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so
 (0x0000200021270000)
      libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000)
      libpgkomp.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so
 (0x00002000212d0000)
      libomp.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so
 (0x0000200021300000)
      libomptarget.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so
 (0x00002000213f0000)
      libpgmath.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so
 (0x0000200021420000)
      libpgc.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so
 (0x0000200021540000)
      librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000)
      libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000)
      libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000)
      libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000)
      libz.so.1 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1
 (0x0000200021a10000)
      libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000)
      /lib64/ld64.so.2 (0x0000200000000000)
      libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 
(0x0000200021ab0000)
      libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000)
      libhwloc_ompi.so.15 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15
 (0x00002000239f0000)
      libevent-2.1.so.6 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6
 (0x0000200023a60000)
      libevent_pthreads-2.1.so.6 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6
 (0x0000200023ae0000)
      libopen-rte.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3
 (0x0000200023b10000)
      libopen-pal.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3
 (0x0000200023c20000)
      libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000)


On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:


ldd -o on the executable of both linkings of your code.

My guess is that without PETSc it is linking the static version of the needed 
libraries and with PETSc the shared. And, in typical fashion, the shared 
libraries are off on some super slow file system so take a long time to be 
loaded and linked in on demand.

Still a performance bug in Summit.

Barry


On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev <petsc-dev@mcs.anl.gov> 
wrote:

Hi all,

Previously I have noticed that the first call to a CUDA function such as 
cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. 
Then I prepared a simple example as attached to help OCLF reproduce the 
problem. It turned out that the problem was  caused by PETSc. The 7.5-second 
overhead can be observed only when the PETSc lib is linked. If I do not link 
PETSc, it runs normally. Does anyone have any idea why this happens and how to 
fix it?

Hong (Mr.)

bash-4.2$ cat ex_simple.c
#include <time.h>
#include <cuda_runtime.h>
#include <stdio.h>

int main(int argc,char **args)
{
clock_t start,s1,s2,s3;
double  cputime;
double   *init,tmp[100] = {0};

start = clock();
cudaFree(0);
s1 = clock();
cudaMalloc((void **)&init,100*sizeof(double));
s2 = clock();
cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
s3 = clock();
printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / 
CLOCKS_PER_SEC);

return 0;
}








--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/






--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/




Reply via email to