> On Sep 19, 2019, at 9:30 PM, Balay, Satish <[email protected]> wrote:
>
> On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
>
>>
>>
>>> On Sep 19, 2019, at 9:11 PM, Balay, Satish <[email protected]> wrote:
>>>
>>> On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
>>>
>>>>
>>>> This should be reported on gitlab, not in email.
>>>>
>>>> Anyways, my interpretation is that the machine runs low on swap space so
>>>> the OS is killing things. Once Satish and I sat down and checked the
>>>> system logs on one machine that had little swap and we saw system messages
>>>> about low swap at exactly the time the tests were killed. Satish is
>>>> resistant to increase swap I don't know why. Other times we see these
>>>> kills and they may not be due to swap but then they are a mystery.
>>>
>>> That was on bsd.
>>>
>>> This machine has 8gb swap and should be sufficient. And this issue [on this
>>> machine] was triggered
>>> only by this MR - which was wierd..
>>
>> Does it happen every time to the same examples?
>
> I might have tried restating a job once or twice - so yes. And 3 jobs [from a
> single pipeline] failed on this box.
>>
>> If you login and run that one test does it happen?
>
> I've only tried running the failing tests - and they ran fine. Didn't try
> 'make alltests' at that time.
Ok, so it could be related to load issues somehow. Well with the better code
we'll have a better handle on where it is starting.
>
> Satish
>
>>
>> If the MR is changing scatter code could it have broken something.
>>
>> We need to know why this is happening? Otherwise our test system will
>> drive us nuts with errors we don't have a clue where they come from.
>>
>>
>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>
>> So MPI thinks MPI_Abort is called with a return code of 1. PETSc calls
>> MPI_Abort in a truck load of places and usually with a return code of 1. So
>> the first thing that needs to be done is fix PETSc so each different call to
>> MPI_Abort has a unique return code. Then in theory at least we know where it
>> got aborted.
>>
>> include/petscerror.h:#define CHKERRABORT(comm,ierr) do {if
>> (PetscUnlikely(ierr))
>> {PetscError(PETSC_COMM_SELF,__LINE__,PETSC_FUNCTION_NAME,__FILE__,ierr,PETSC_ERROR_REPEAT,"
>> ");MPI_Abort(comm,ierr);}} while (0)
>> include/petscerror.h: or CHKERRABORT(comm,n) to have MPI_Abort() returned
>> immediately.
>> src/contrib/fun3d/incomp/flow.c: /*ierr = MPI_Abort(MPI_COMM_WORLD,1);*/
>> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
>> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
>> src/docs/tao_tex/manual/part1.tex:application called
>> MPI_Abort(MPI_COMM_WORLD, 73) - process 0
>> src/docs/tex/manual/developers.tex: \item
>> \lstinline{PetscMPIAbortErrorHandler()}, which calls \lstinline{MPI_Abort()}
>> after printing the error message; and
>> src/snes/examples/tests/ex12f.F: call
>> MPI_Abort(PETSC_COMM_WORLD,0,ierr)
>> src/snes/examples/tutorials/ex30.c: MPI_Abort(PETSC_COMM_SELF,1);
>> src/sys/error/adebug.c: MPI_Abort(PETSC_COMM_WORLD,1);
>> src/sys/error/err.c: If this is called from the main() routine we call
>> MPI_Abort() instead of
>> src/sys/error/err.c: if (ismain) MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
>> src/sys/error/errstop.c: MPI_Abort(PETSC_COMM_WORLD,n);
>> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/signal.c: if (ierr) MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/signal.c: MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
>> src/sys/fsrc/somefort.F:! when MPI_Abort() is called directly by
>> CHKERRQ(ierr);
>> src/sys/fsrc/somefort.F: call MPI_Abort(comm,ierr,nierr)
>> src/sys/ftn-custom/zutils.c: MPI_Abort(PETSC_COMM_WORLD,1);
>> src/sys/ftn-custom/zutils.c: MPI_Abort(PETSC_COMM_WORLD,1);
>> src/sys/logging/utils/stagelog.c: MPI_Abort(MPI_COMM_WORLD,
>> PETSC_ERR_SUP);
>> src/sys/mpiuni/mpi.c:int MPI_Abort(MPI_Comm comm,int errorcode)
>> src/sys/mpiuni/mpitime.c: if (!QueryPerformanceCounter(&StartTime))
>> MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/mpiuni/mpitime.c: if (!QueryPerformanceFrequency(&PerfFreq))
>> MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/mpiuni/mpitime.c: if (!QueryPerformanceCounter(&CurTime))
>> MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/init.c: in the debugger hence we call abort() instead of
>> MPI_Abort().
>> src/sys/objects/init.c:void Petsc_MPI_AbortOnError(MPI_Comm
>> *comm,PetscMPIInt *flag,...)
>> src/sys/objects/init.c: if (ierr) MPI_Abort(*comm,*flag); /* hopeless so
>> get out */
>> src/sys/objects/init.c: ierr =
>> MPI_Comm_create_errhandler(Petsc_MPI_AbortOnError,&err_handler);CHKERRQ(ierr);
>> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1);
>> src/ts/examples/tutorials/ex48.c: if (dim < 2)
>> {MPI_Abort(MPI_COMM_WORLD,1); return;} /* this is needed so that the clang
>> static analyzer does not generate a warning about variables used by not set
>> */
>> src/vec/vec/examples/tests/ex32f.F: call
>> MPI_Abort(MPI_COMM_WORLD,0,ierr)
>> src/vec/vec/interface/dlregisvec.c: MPI_Abort(MPI_COMM_SELF,1);
>> src/vec/vec/interface/dlregisvec.c: MPI_Abort(MPI_COMM_SELF,1);
>> src/vec/vec/utils/comb.c: MPI_Abort(MPI_COMM_SELF,1);
>> src/vec/vec/utils/comb.c: MPI_Abort(MPI_COMM_SELF,1);
>>
>> Junchao,
>>
>> Maybe you could fix this and make a MR? I don't know how to organize the
>> numbering. Should we have a central list of all numbers with macros in
>> petscerror.h like
>>
>> #define PETSC_MPI_ABORT_MPIU_MaxIndex_Local 10
>>
>> etc?
>>
>>
>>
>>
>>
>>
>>
>> Barry
>>
>>>
>>> Satish
>>>
>>>
>>>>
>>>> You can return the particular job by clicking on the little circle after
>>>> the job name and see what happens the next time.
>>>>
>>>> Barry
>>>>
>>>> It may be the -j and -l options for some systems need to adjusted down
>>>> slightly and this will prevent these. Satish can that be done in the
>>>> examples/arch-ci* files with configure options, or in in the runner files
>>>> or in the yaml file?
>>>
>>> configure has options --with-make-np --with-make-test-np --with-make-load
>>>
>>> Satish
>>>
>>>>
>>>>
>>>>
>>>>> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <[email protected]> wrote:
>>>>>
>>>>> All failed tests just said "application called MPI_Abort" and had no
>>>>> stack trace. They are not cuda tests. I updated SF to avoid CUDA related
>>>>> initialization if not needed. Let's see the new test result.
>>>>> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
>>>>> # application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>>>>
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Failed? Means nothing, send link or cut and paste error
>>>>>
>>>>> It could be that since we have multiple separate tests running at the
>>>>> same time they overload the GPU or cause some inconsistent behavior that
>>>>> doesn't appear every time the tests are run.
>>>>>
>>>>> Barry
>>>>>
>>>>> Maybe we need to sequentialize all the tests that use the GPUs, we just
>>>>> trust gnumake for the parallelism maybe you could some how add
>>>>> dependencies to get gnu make to achieve this?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <[email protected]> wrote:
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <[email protected]> wrote:
>>>>>>>
>>>>>>> I saw your update. In PetscCUDAInitialize we have
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> /* First get the device count */
>>>>>>>
>>>>>>> err = cudaGetDeviceCount(&devCount);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> /* next determine the rank and then set the device via a mod */
>>>>>>>
>>>>>>> ierr = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
>>>>>>>
>>>>>>> device = rank % devCount;
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> err = cudaSetDevice(device);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If we rely on the first CUDA call to do initialization, how could CUDA
>>>>>>> know these MPI stuff.
>>>>>>
>>>>>> It doesn't, so it does whatever it does (which may be dumb).
>>>>>>
>>>>>> Are you proposing something?
>>>>>>
>>>>>> No. My test failed in CI with -cuda_initialize 0 on frog but I could not
>>>>>> reproduce it. I'm doing investigation.
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>>>
>>>>>>> --Junchao Zhang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Fixed the docs. Thanks for pointing out the lack of clarity
>>>>>>>
>>>>>>>
>>>>>>>> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Barry,
>>>>>>>>
>>>>>>>> I saw you added these in init.c
>>>>>>>>
>>>>>>>>
>>>>>>>> + -cuda_initialize - do the initialization in PetscInitialize()
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Notes:
>>>>>>>>
>>>>>>>> Initializing cuBLAS takes about 1/2 second there it is done by
>>>>>>>> default in PetscInitialize() before logging begins
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> But I did not get otherwise with -cuda_initialize 0, when will cuda be
>>>>>>>> initialized?
>>>>>>>> --Junchao Zhang