On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
>
>
> > On Sep 19, 2019, at 9:11 PM, Balay, Satish <[email protected]> wrote:
> >
> > On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
> >
> >>
> >> This should be reported on gitlab, not in email.
> >>
> >> Anyways, my interpretation is that the machine runs low on swap space so
> >> the OS is killing things. Once Satish and I sat down and checked the
> >> system logs on one machine that had little swap and we saw system messages
> >> about low swap at exactly the time the tests were killed. Satish is
> >> resistant to increase swap I don't know why. Other times we see these
> >> kills and they may not be due to swap but then they are a mystery.
> >
> > That was on bsd.
> >
> > This machine has 8gb swap and should be sufficient. And this issue [on this
> > machine] was triggered
> > only by this MR - which was wierd..
>
> Does it happen every time to the same examples?
I might have tried restating a job once or twice - so yes. And 3 jobs [from a
single pipeline] failed on this box.
>
> If you login and run that one test does it happen?
I've only tried running the failing tests - and they ran fine. Didn't try 'make
alltests' at that time.
Satish
>
> If the MR is changing scatter code could it have broken something.
>
> We need to know why this is happening? Otherwise our test system will
> drive us nuts with errors we don't have a clue where they come from.
>
>
> >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>
> So MPI thinks MPI_Abort is called with a return code of 1. PETSc calls
> MPI_Abort in a truck load of places and usually with a return code of 1. So
> the first thing that needs to be done is fix PETSc so each different call to
> MPI_Abort has a unique return code. Then in theory at least we know where it
> got aborted.
>
> include/petscerror.h:#define CHKERRABORT(comm,ierr) do {if
> (PetscUnlikely(ierr))
> {PetscError(PETSC_COMM_SELF,__LINE__,PETSC_FUNCTION_NAME,__FILE__,ierr,PETSC_ERROR_REPEAT,"
> ");MPI_Abort(comm,ierr);}} while (0)
> include/petscerror.h: or CHKERRABORT(comm,n) to have MPI_Abort() returned
> immediately.
> src/contrib/fun3d/incomp/flow.c: /*ierr = MPI_Abort(MPI_COMM_WORLD,1);*/
> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
> src/docs/tao_tex/manual/part1.tex:application called
> MPI_Abort(MPI_COMM_WORLD, 73) - process 0
> src/docs/tex/manual/developers.tex: \item
> \lstinline{PetscMPIAbortErrorHandler()}, which calls \lstinline{MPI_Abort()}
> after printing the error message; and
> src/snes/examples/tests/ex12f.F: call
> MPI_Abort(PETSC_COMM_WORLD,0,ierr)
> src/snes/examples/tutorials/ex30.c: MPI_Abort(PETSC_COMM_SELF,1);
> src/sys/error/adebug.c: MPI_Abort(PETSC_COMM_WORLD,1);
> src/sys/error/err.c: If this is called from the main() routine we call
> MPI_Abort() instead of
> src/sys/error/err.c: if (ismain) MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
> src/sys/error/errstop.c: MPI_Abort(PETSC_COMM_WORLD,n);
> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/signal.c: if (ierr) MPI_Abort(PETSC_COMM_WORLD,0);
> src/sys/error/signal.c: MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
> src/sys/fsrc/somefort.F:! when MPI_Abort() is called directly by
> CHKERRQ(ierr);
> src/sys/fsrc/somefort.F: call MPI_Abort(comm,ierr,nierr)
> src/sys/ftn-custom/zutils.c: MPI_Abort(PETSC_COMM_WORLD,1);
> src/sys/ftn-custom/zutils.c: MPI_Abort(PETSC_COMM_WORLD,1);
> src/sys/logging/utils/stagelog.c: MPI_Abort(MPI_COMM_WORLD, PETSC_ERR_SUP);
> src/sys/mpiuni/mpi.c:int MPI_Abort(MPI_Comm comm,int errorcode)
> src/sys/mpiuni/mpitime.c: if (!QueryPerformanceCounter(&StartTime))
> MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/mpiuni/mpitime.c: if (!QueryPerformanceFrequency(&PerfFreq))
> MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/mpiuni/mpitime.c: if (!QueryPerformanceCounter(&CurTime))
> MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/init.c: in the debugger hence we call abort() instead of
> MPI_Abort().
> src/sys/objects/init.c:void Petsc_MPI_AbortOnError(MPI_Comm *comm,PetscMPIInt
> *flag,...)
> src/sys/objects/init.c: if (ierr) MPI_Abort(*comm,*flag); /* hopeless so get
> out */
> src/sys/objects/init.c: ierr =
> MPI_Comm_create_errhandler(Petsc_MPI_AbortOnError,&err_handler);CHKERRQ(ierr);
> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
> src/ts/examples/tutorials/ex48.c: if (dim < 2) {MPI_Abort(MPI_COMM_WORLD,1);
> return;} /* this is needed so that the clang static analyzer does not
> generate a warning about variables used by not set */
> src/vec/vec/examples/tests/ex32f.F: call
> MPI_Abort(MPI_COMM_WORLD,0,ierr)
> src/vec/vec/interface/dlregisvec.c: MPI_Abort(MPI_COMM_SELF,1);
> src/vec/vec/interface/dlregisvec.c: MPI_Abort(MPI_COMM_SELF,1);
> src/vec/vec/utils/comb.c: MPI_Abort(MPI_COMM_SELF,1);
> src/vec/vec/utils/comb.c: MPI_Abort(MPI_COMM_SELF,1);
>
> Junchao,
>
> Maybe you could fix this and make a MR? I don't know how to organize the
> numbering. Should we have a central list of all numbers with macros in
> petscerror.h like
>
> #define PETSC_MPI_ABORT_MPIU_MaxIndex_Local 10
>
> etc?
>
>
>
>
>
>
>
> Barry
>
> >
> > Satish
> >
> >
> >>
> >> You can return the particular job by clicking on the little circle after
> >> the job name and see what happens the next time.
> >>
> >> Barry
> >>
> >> It may be the -j and -l options for some systems need to adjusted down
> >> slightly and this will prevent these. Satish can that be done in the
> >> examples/arch-ci* files with configure options, or in in the runner files
> >> or in the yaml file?
> >
> > configure has options --with-make-np --with-make-test-np --with-make-load
> >
> > Satish
> >
> >>
> >>
> >>
> >>> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <[email protected]> wrote:
> >>>
> >>> All failed tests just said "application called MPI_Abort" and had no
> >>> stack trace. They are not cuda tests. I updated SF to avoid CUDA related
> >>> initialization if not needed. Let's see the new test result.
> >>> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
> >>> # application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> >>>
> >>>
> >>> --Junchao Zhang
> >>>
> >>>
> >>> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <[email protected]>
> >>> wrote:
> >>>
> >>> Failed? Means nothing, send link or cut and paste error
> >>>
> >>> It could be that since we have multiple separate tests running at the
> >>> same time they overload the GPU or cause some inconsistent behavior that
> >>> doesn't appear every time the tests are run.
> >>>
> >>> Barry
> >>>
> >>> Maybe we need to sequentialize all the tests that use the GPUs, we just
> >>> trust gnumake for the parallelism maybe you could some how add
> >>> dependencies to get gnu make to achieve this?
> >>>
> >>>
> >>>
> >>>
> >>>> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <[email protected]> wrote:
> >>>>
> >>>> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <[email protected]>
> >>>> wrote:
> >>>>
> >>>>
> >>>>> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <[email protected]> wrote:
> >>>>>
> >>>>> I saw your update. In PetscCUDAInitialize we have
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> /* First get the device count */
> >>>>>
> >>>>> err = cudaGetDeviceCount(&devCount);
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> /* next determine the rank and then set the device via a mod */
> >>>>>
> >>>>> ierr = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
> >>>>>
> >>>>> device = rank % devCount;
> >>>>>
> >>>>> }
> >>>>>
> >>>>> err = cudaSetDevice(device);
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> If we rely on the first CUDA call to do initialization, how could CUDA
> >>>>> know these MPI stuff.
> >>>>
> >>>> It doesn't, so it does whatever it does (which may be dumb).
> >>>>
> >>>> Are you proposing something?
> >>>>
> >>>> No. My test failed in CI with -cuda_initialize 0 on frog but I could not
> >>>> reproduce it. I'm doing investigation.
> >>>>
> >>>> Barry
> >>>>
> >>>>>
> >>>>> --Junchao Zhang
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>> Fixed the docs. Thanks for pointing out the lack of clarity
> >>>>>
> >>>>>
> >>>>>> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev
> >>>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> Barry,
> >>>>>>
> >>>>>> I saw you added these in init.c
> >>>>>>
> >>>>>>
> >>>>>> + -cuda_initialize - do the initialization in PetscInitialize()
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Notes:
> >>>>>>
> >>>>>> Initializing cuBLAS takes about 1/2 second there it is done by
> >>>>>> default in PetscInitialize() before logging begins
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> But I did not get otherwise with -cuda_initialize 0, when will cuda be
> >>>>>> initialized?
> >>>>>> --Junchao Zhang
> >>>>>
> >>>
> >>
> >
>