On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote: > > This should be reported on gitlab, not in email. > > Anyways, my interpretation is that the machine runs low on swap space so > the OS is killing things. Once Satish and I sat down and checked the system > logs on one machine that had little swap and we saw system messages about low > swap at exactly the time the tests were killed. Satish is resistant to > increase swap I don't know why. Other times we see these kills and they may > not be due to swap but then they are a mystery.
That was on bsd. This machine has 8gb swap and should be sufficient. And this issue [on this machine] was triggered only by this MR - which was wierd.. Satish > > You can return the particular job by clicking on the little circle after > the job name and see what happens the next time. > > Barry > > It may be the -j and -l options for some systems need to adjusted down > slightly and this will prevent these. Satish can that be done in the > examples/arch-ci* files with configure options, or in in the runner files or > in the yaml file? configure has options --with-make-np --with-make-test-np --with-make-load Satish > > > > > On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <[email protected]> wrote: > > > > All failed tests just said "application called MPI_Abort" and had no stack > > trace. They are not cuda tests. I updated SF to avoid CUDA related > > initialization if not needed. Let's see the new test result. > > not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1 > > # application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 > > > > > > --Junchao Zhang > > > > > > On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <[email protected]> wrote: > > > > Failed? Means nothing, send link or cut and paste error > > > > It could be that since we have multiple separate tests running at the same > > time they overload the GPU or cause some inconsistent behavior that doesn't > > appear every time the tests are run. > > > > Barry > > > > Maybe we need to sequentialize all the tests that use the GPUs, we just > > trust gnumake for the parallelism maybe you could some how add dependencies > > to get gnu make to achieve this? > > > > > > > > > > > On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <[email protected]> wrote: > > > > > > On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <[email protected]> > > > wrote: > > > > > > > > > > On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <[email protected]> wrote: > > > > > > > > I saw your update. In PetscCUDAInitialize we have > > > > > > > > > > > > > > > > > > > > > > > > /* First get the device count */ > > > > > > > > err = cudaGetDeviceCount(&devCount); > > > > > > > > > > > > > > > > > > > > /* next determine the rank and then set the device via a mod */ > > > > > > > > ierr = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr); > > > > > > > > device = rank % devCount; > > > > > > > > } > > > > > > > > err = cudaSetDevice(device); > > > > > > > > > > > > > > > > > > > > > > > > If we rely on the first CUDA call to do initialization, how could CUDA > > > > know these MPI stuff. > > > > > > It doesn't, so it does whatever it does (which may be dumb). > > > > > > Are you proposing something? > > > > > > No. My test failed in CI with -cuda_initialize 0 on frog but I could not > > > reproduce it. I'm doing investigation. > > > > > > Barry > > > > > > > > > > > --Junchao Zhang > > > > > > > > > > > > > > > > On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <[email protected]> > > > > wrote: > > > > > > > > Fixed the docs. Thanks for pointing out the lack of clarity > > > > > > > > > > > > > On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev > > > > > <[email protected]> wrote: > > > > > > > > > > Barry, > > > > > > > > > > I saw you added these in init.c > > > > > > > > > > > > > > > + -cuda_initialize - do the initialization in PetscInitialize() > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Notes: > > > > > > > > > > Initializing cuBLAS takes about 1/2 second there it is done by > > > > > default in PetscInitialize() before logging begins > > > > > > > > > > > > > > > > > > > > But I did not get otherwise with -cuda_initialize 0, when will cuda > > > > > be initialized? > > > > > --Junchao Zhang > > > > > > >
