This should be reported on gitlab, not in email.

   Anyways, my interpretation is that the machine runs low on swap space so the 
OS is killing things. Once Satish and I sat down and checked the system logs on 
one machine that had little swap and we saw system messages about low swap at 
exactly the time the tests were killed. Satish is resistant to increase swap I 
don't know why. Other times we see these kills and they may not be due to swap 
but then they are a mystery.

   You can return the particular job by clicking on the little circle after the 
job name and see what happens the next time.

   Barry

   It may be the -j and -l options for some systems need to adjusted down 
slightly and this will prevent these. Satish can that be done in the 
examples/arch-ci* files with configure options, or in in the runner files or in 
the yaml file?



> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <[email protected]> wrote:
> 
> All failed tests just said "application called MPI_Abort" and had no stack 
> trace. They are not cuda tests. I updated SF to avoid CUDA  related 
> initialization if not needed. Let's see the new test result.
> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
> #     application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> 
> 
> --Junchao Zhang
> 
> 
> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <[email protected]> wrote:
> 
>  Failed?  Means nothing, send link or cut and paste error
> 
>  It could be that since we have multiple separate tests running at the same 
> time they overload the GPU or cause some inconsistent behavior that doesn't 
> appear every time the tests are run.
> 
>    Barry
> 
> Maybe we need to sequentialize all the tests that use the GPUs, we just trust 
> gnumake for the parallelism maybe you could some how add dependencies to get 
> gnu make to achieve this?
> 
> 
> 
> 
> > On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <[email protected]> wrote:
> > 
> > On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <[email protected]> wrote:
> > 
> > 
> > > On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <[email protected]> wrote:
> > > 
> > > I saw your update. In PetscCUDAInitialize we have
> > > 
> > >     
> > > 
> > > 
> > > 
> > >       /* First get the device count */
> > > 
> > >       err   = cudaGetDeviceCount(&devCount);
> > > 
> > > 
> > > 
> > > 
> > >       /* next determine the rank and then set the device via a mod */
> > > 
> > >       ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
> > > 
> > >       device = rank % devCount;
> > > 
> > >     }
> > > 
> > >     err = cudaSetDevice(device);
> > > 
> > > 
> > > 
> > > 
> > > 
> > > If we rely on the first CUDA call to do initialization, how could CUDA 
> > > know these MPI stuff.
> > 
> >   It doesn't, so it does whatever it does (which may be dumb).
> > 
> >   Are you proposing something?
> > 
> > No. My test failed in CI with -cuda_initialize 0 on frog but I could not 
> > reproduce it. I'm doing investigation. 
> > 
> >   Barry
> > 
> > > 
> > > --Junchao Zhang
> > > 
> > > 
> > > 
> > > On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <[email protected]> 
> > > wrote:
> > > 
> > >   Fixed the docs. Thanks for pointing out the lack of clarity
> > > 
> > > 
> > > > On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
> > > > <[email protected]> wrote:
> > > > 
> > > > Barry,
> > > > 
> > > > I saw you added these in init.c
> > > > 
> > > > 
> > > > +  -cuda_initialize - do the initialization in PetscInitialize()
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Notes:
> > > > 
> > > >    Initializing cuBLAS takes about 1/2 second there it is done by 
> > > > default in PetscInitialize() before logging begins
> > > > 
> > > > 
> > > > 
> > > > But I did not get otherwise with -cuda_initialize 0, when will cuda be 
> > > > initialized?
> > > > --Junchao Zhang
> > > 
> 

Reply via email to