On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:

> 
>    This should be reported on gitlab, not in email.
> 
>    Anyways, my interpretation is that the machine runs low on swap space so 
> the OS is killing things. Once Satish and I sat down and checked the system 
> logs on one machine that had little swap and we saw system messages about low 
> swap at exactly the time the tests were killed. Satish is resistant to 
> increase swap I don't know why. Other times we see these kills and they may 
> not be due to swap but then they are a mystery.

That was on bsd.

This machine has 8gb swap and should be sufficient. And this issue [on this 
machine] was triggered
only by this MR - which was wierd..

Satish


> 
>    You can return the particular job by clicking on the little circle after 
> the job name and see what happens the next time.
> 
>    Barry
> 
>    It may be the -j and -l options for some systems need to adjusted down 
> slightly and this will prevent these. Satish can that be done in the 
> examples/arch-ci* files with configure options, or in in the runner files or 
> in the yaml file?

configure has options --with-make-np --with-make-test-np --with-make-load

Satish

> 
> 
> 
> > On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <[email protected]> wrote:
> > 
> > All failed tests just said "application called MPI_Abort" and had no stack 
> > trace. They are not cuda tests. I updated SF to avoid CUDA  related 
> > initialization if not needed. Let's see the new test result.
> > not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
> > #   application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> > 
> > 
> > --Junchao Zhang
> > 
> > 
> > On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <[email protected]> wrote:
> > 
> >  Failed?  Means nothing, send link or cut and paste error
> > 
> >  It could be that since we have multiple separate tests running at the same 
> > time they overload the GPU or cause some inconsistent behavior that doesn't 
> > appear every time the tests are run.
> > 
> >    Barry
> > 
> > Maybe we need to sequentialize all the tests that use the GPUs, we just 
> > trust gnumake for the parallelism maybe you could some how add dependencies 
> > to get gnu make to achieve this?
> > 
> > 
> > 
> > 
> > > On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <[email protected]> wrote:
> > > 
> > > On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <[email protected]> 
> > > wrote:
> > > 
> > > 
> > > > On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <[email protected]> wrote:
> > > > 
> > > > I saw your update. In PetscCUDAInitialize we have
> > > > 
> > > >     
> > > > 
> > > > 
> > > > 
> > > >       /* First get the device count */
> > > > 
> > > >       err   = cudaGetDeviceCount(&devCount);
> > > > 
> > > > 
> > > > 
> > > > 
> > > >       /* next determine the rank and then set the device via a mod */
> > > > 
> > > >       ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
> > > > 
> > > >       device = rank % devCount;
> > > > 
> > > >     }
> > > > 
> > > >     err = cudaSetDevice(device);
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > If we rely on the first CUDA call to do initialization, how could CUDA 
> > > > know these MPI stuff.
> > > 
> > >   It doesn't, so it does whatever it does (which may be dumb).
> > > 
> > >   Are you proposing something?
> > > 
> > > No. My test failed in CI with -cuda_initialize 0 on frog but I could not 
> > > reproduce it. I'm doing investigation. 
> > > 
> > >   Barry
> > > 
> > > > 
> > > > --Junchao Zhang
> > > > 
> > > > 
> > > > 
> > > > On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <[email protected]> 
> > > > wrote:
> > > > 
> > > >   Fixed the docs. Thanks for pointing out the lack of clarity
> > > > 
> > > > 
> > > > > On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
> > > > > <[email protected]> wrote:
> > > > > 
> > > > > Barry,
> > > > > 
> > > > > I saw you added these in init.c
> > > > > 
> > > > > 
> > > > > +  -cuda_initialize - do the initialization in PetscInitialize()
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Notes:
> > > > > 
> > > > >    Initializing cuBLAS takes about 1/2 second there it is done by 
> > > > > default in PetscInitialize() before logging begins
> > > > > 
> > > > > 
> > > > > 
> > > > > But I did not get otherwise with -cuda_initialize 0, when will cuda 
> > > > > be initialized?
> > > > > --Junchao Zhang
> > > > 
> > 
> 

Reply via email to