Re: [petsc-dev] PetscCUDAInitialize

Balay, Satish via petsc-dev Thu, 19 Sep 2019 19:30:22 -0700

On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:

> 
> 
> > On Sep 19, 2019, at 9:11 PM, Balay, Satish <[email protected]> wrote:
> > 
> > On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
> > 
> >> 
> >>   This should be reported on gitlab, not in email.
> >> 
> >>   Anyways, my interpretation is that the machine runs low on swap space so 
> >> the OS is killing things. Once Satish and I sat down and checked the 
> >> system logs on one machine that had little swap and we saw system messages 
> >> about low swap at exactly the time the tests were killed. Satish is 
> >> resistant to increase swap I don't know why. Other times we see these 
> >> kills and they may not be due to swap but then they are a mystery.
> > 
> > That was on bsd.
> > 
> > This machine has 8gb swap and should be sufficient. And this issue [on this 
> > machine] was triggered
> > only by this MR - which was wierd..
> 
>    Does it happen every time to the same examples?


I might have tried restating a job once or twice - so yes. And 3 jobs [from a 
single pipeline] failed on this box.
> 
>    If you login and run that one test does it happen?

I've only tried running the failing tests - and they ran fine. Didn't try 'make 
alltests' at that time.

Satish

> 
>    If the MR is changing scatter code could it have broken something.
> 
>    We need to know why this is happening? Otherwise our test system will 
> drive us nuts with errors we don't have a clue where they come from.
> 
>   
> >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> 
>   So MPI thinks MPI_Abort is called with a return code of 1. PETSc calls 
> MPI_Abort in a truck load of places and usually with a return code of 1. So 
> the first thing that needs to be done is fix PETSc so each different call to 
> MPI_Abort has a unique return code. Then in theory at least we know where it 
> got aborted.
> 
> include/petscerror.h:#define CHKERRABORT(comm,ierr) do {if 
> (PetscUnlikely(ierr)) 
> {PetscError(PETSC_COMM_SELF,__LINE__,PETSC_FUNCTION_NAME,__FILE__,ierr,PETSC_ERROR_REPEAT,"
>  ");MPI_Abort(comm,ierr);}} while (0)
> include/petscerror.h:    or CHKERRABORT(comm,n) to have MPI_Abort() returned 
> immediately.
> src/contrib/fun3d/incomp/flow.c:    /*ierr = MPI_Abort(MPI_COMM_WORLD,1);*/
> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
> src/docs/tao_tex/manual/part1.tex:application called 
> MPI_Abort(MPI_COMM_WORLD, 73) - process 0
> src/docs/tex/manual/developers.tex:  \item 
> \lstinline{PetscMPIAbortErrorHandler()}, which calls \lstinline{MPI_Abort()} 
> after printing the error message; and
> src/snes/examples/tests/ex12f.F:        call 
> MPI_Abort(PETSC_COMM_WORLD,0,ierr)
> src/snes/examples/tutorials/ex30.c:  MPI_Abort(PETSC_COMM_SELF,1);
> src/sys/error/adebug.c:  MPI_Abort(PETSC_COMM_WORLD,1);
> src/sys/error/err.c:      If this is called from the main() routine we call 
> MPI_Abort() instead of
> src/sys/error/err.c:  if (ismain) MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
> src/sys/error/errstop.c:  MPI_Abort(PETSC_COMM_WORLD,n);
> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/signal.c:  if (ierr) MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/signal.c:  MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
> src/sys/fsrc/somefort.F:!     when MPI_Abort() is called directly by 
> CHKERRQ(ierr);
> src/sys/fsrc/somefort.F:      call MPI_Abort(comm,ierr,nierr)
> src/sys/ftn-custom/zutils.c:    MPI_Abort(PETSC_COMM_WORLD,1);
> src/sys/ftn-custom/zutils.c:      MPI_Abort(PETSC_COMM_WORLD,1);
> src/sys/logging/utils/stagelog.c:    MPI_Abort(MPI_COMM_WORLD, PETSC_ERR_SUP);
> src/sys/mpiuni/mpi.c:int MPI_Abort(MPI_Comm comm,int errorcode)
> src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceCounter(&StartTime)) 
> MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceFrequency(&PerfFreq)) 
> MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/mpiuni/mpitime.c:  if (!QueryPerformanceCounter(&CurTime)) 
> MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/init.c:  in the debugger hence we call abort() instead of 
> MPI_Abort().
> src/sys/objects/init.c:void Petsc_MPI_AbortOnError(MPI_Comm *comm,PetscMPIInt 
> *flag,...)
> src/sys/objects/init.c:  if (ierr) MPI_Abort(*comm,*flag); /* hopeless so get 
> out */
> src/sys/objects/init.c:      ierr = 
> MPI_Comm_create_errhandler(Petsc_MPI_AbortOnError,&err_handler);CHKERRQ(ierr);
> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
> src/ts/examples/tutorials/ex48.c:  if (dim < 2) {MPI_Abort(MPI_COMM_WORLD,1); 
> return;} /* this is needed so that the clang static analyzer does not 
> generate a warning about variables used by not set */
> src/vec/vec/examples/tests/ex32f.F:        call 
> MPI_Abort(MPI_COMM_WORLD,0,ierr)
> src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
> src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
> src/vec/vec/utils/comb.c:    MPI_Abort(MPI_COMM_SELF,1);
> src/vec/vec/utils/comb.c:      MPI_Abort(MPI_COMM_SELF,1);
> 
>   Junchao,
> 
>      Maybe you could fix this and make a MR? I don't know how to organize the 
> numbering. Should we have a central list of all numbers with macros in 
> petscerror.h like 
> 
> #define PETSC_MPI_ABORT_MPIU_MaxIndex_Local 10 
> 
> etc?
> 
> 
> 
> 
> 
> 
> 
>    Barry
> 
> > 
> > Satish
> > 
> > 
> >> 
> >>   You can return the particular job by clicking on the little circle after 
> >> the job name and see what happens the next time.
> >> 
> >>   Barry
> >> 
> >>   It may be the -j and -l options for some systems need to adjusted down 
> >> slightly and this will prevent these. Satish can that be done in the 
> >> examples/arch-ci* files with configure options, or in in the runner files 
> >> or in the yaml file?
> > 
> > configure has options --with-make-np --with-make-test-np --with-make-load
> > 
> > Satish
> > 
> >> 
> >> 
> >> 
> >>> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <[email protected]> wrote:
> >>> 
> >>> All failed tests just said "application called MPI_Abort" and had no 
> >>> stack trace. They are not cuda tests. I updated SF to avoid CUDA  related 
> >>> initialization if not needed. Let's see the new test result.
> >>> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
> >>> # application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> >>> 
> >>> 
> >>> --Junchao Zhang
> >>> 
> >>> 
> >>> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <[email protected]> 
> >>> wrote:
> >>> 
> >>> Failed?  Means nothing, send link or cut and paste error
> >>> 
> >>> It could be that since we have multiple separate tests running at the 
> >>> same time they overload the GPU or cause some inconsistent behavior that 
> >>> doesn't appear every time the tests are run.
> >>> 
> >>>   Barry
> >>> 
> >>> Maybe we need to sequentialize all the tests that use the GPUs, we just 
> >>> trust gnumake for the parallelism maybe you could some how add 
> >>> dependencies to get gnu make to achieve this?
> >>> 
> >>> 
> >>> 
> >>> 
> >>>> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <[email protected]> wrote:
> >>>> 
> >>>> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <[email protected]> 
> >>>> wrote:
> >>>> 
> >>>> 
> >>>>> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <[email protected]> wrote:
> >>>>> 
> >>>>> I saw your update. In PetscCUDAInitialize we have
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>      /* First get the device count */
> >>>>> 
> >>>>>      err   = cudaGetDeviceCount(&devCount);
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>      /* next determine the rank and then set the device via a mod */
> >>>>> 
> >>>>>      ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
> >>>>> 
> >>>>>      device = rank % devCount;
> >>>>> 
> >>>>>    }
> >>>>> 
> >>>>>    err = cudaSetDevice(device);
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> If we rely on the first CUDA call to do initialization, how could CUDA 
> >>>>> know these MPI stuff.
> >>>> 
> >>>>  It doesn't, so it does whatever it does (which may be dumb).
> >>>> 
> >>>>  Are you proposing something?
> >>>> 
> >>>> No. My test failed in CI with -cuda_initialize 0 on frog but I could not 
> >>>> reproduce it. I'm doing investigation. 
> >>>> 
> >>>>  Barry
> >>>> 
> >>>>> 
> >>>>> --Junchao Zhang
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <[email protected]> 
> >>>>> wrote:
> >>>>> 
> >>>>>  Fixed the docs. Thanks for pointing out the lack of clarity
> >>>>> 
> >>>>> 
> >>>>>> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
> >>>>>> <[email protected]> wrote:
> >>>>>> 
> >>>>>> Barry,
> >>>>>> 
> >>>>>> I saw you added these in init.c
> >>>>>> 
> >>>>>> 
> >>>>>> +  -cuda_initialize - do the initialization in PetscInitialize()
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> Notes:
> >>>>>> 
> >>>>>>   Initializing cuBLAS takes about 1/2 second there it is done by 
> >>>>>> default in PetscInitialize() before logging begins
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> But I did not get otherwise with -cuda_initialize 0, when will cuda be 
> >>>>>> initialized?
> >>>>>> --Junchao Zhang
> >>>>> 
> >>> 
> >> 
> > 
>

Re: [petsc-dev] PetscCUDAInitialize

Reply via email to