> On Sep 19, 2019, at 9:30 PM, Balay, Satish <[email protected]> wrote:
> 
> On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
> 
>> 
>> 
>>> On Sep 19, 2019, at 9:11 PM, Balay, Satish <[email protected]> wrote:
>>> 
>>> On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
>>> 
>>>> 
>>>>  This should be reported on gitlab, not in email.
>>>> 
>>>>  Anyways, my interpretation is that the machine runs low on swap space so 
>>>> the OS is killing things. Once Satish and I sat down and checked the 
>>>> system logs on one machine that had little swap and we saw system messages 
>>>> about low swap at exactly the time the tests were killed. Satish is 
>>>> resistant to increase swap I don't know why. Other times we see these 
>>>> kills and they may not be due to swap but then they are a mystery.
>>> 
>>> That was on bsd.
>>> 
>>> This machine has 8gb swap and should be sufficient. And this issue [on this 
>>> machine] was triggered
>>> only by this MR - which was wierd..
>> 
>>   Does it happen every time to the same examples?
> 
> I might have tried restating a job once or twice - so yes. And 3 jobs [from a 
> single pipeline] failed on this box.
>> 
>>   If you login and run that one test does it happen?
> 
> I've only tried running the failing tests - and they ran fine. Didn't try 
> 'make alltests' at that time.

  Ok, so it could be related to load issues somehow. Well with the better code 
we'll have a better handle on where it is starting.
> 
> Satish
> 
>> 
>>   If the MR is changing scatter code could it have broken something.
>> 
>>   We need to know why this is happening? Otherwise our test system will 
>> drive us nuts with errors we don't have a clue where they come from.
>> 
>> 
>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>> 
>>  So MPI thinks MPI_Abort is called with a return code of 1. PETSc calls 
>> MPI_Abort in a truck load of places and usually with a return code of 1. So 
>> the first thing that needs to be done is fix PETSc so each different call to 
>> MPI_Abort has a unique return code. Then in theory at least we know where it 
>> got aborted.
>> 
>> include/petscerror.h:#define CHKERRABORT(comm,ierr) do {if 
>> (PetscUnlikely(ierr)) 
>> {PetscError(PETSC_COMM_SELF,__LINE__,PETSC_FUNCTION_NAME,__FILE__,ierr,PETSC_ERROR_REPEAT,"
>>  ");MPI_Abort(comm,ierr);}} while (0)
>> include/petscerror.h:    or CHKERRABORT(comm,n) to have MPI_Abort() returned 
>> immediately.
>> src/contrib/fun3d/incomp/flow.c:    /*ierr = MPI_Abort(MPI_COMM_WORLD,1);*/
>> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
>> src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
>> src/docs/tao_tex/manual/part1.tex:application called 
>> MPI_Abort(MPI_COMM_WORLD, 73) - process 0
>> src/docs/tex/manual/developers.tex:  \item 
>> \lstinline{PetscMPIAbortErrorHandler()}, which calls \lstinline{MPI_Abort()} 
>> after printing the error message; and
>> src/snes/examples/tests/ex12f.F:        call 
>> MPI_Abort(PETSC_COMM_WORLD,0,ierr)
>> src/snes/examples/tutorials/ex30.c:  MPI_Abort(PETSC_COMM_SELF,1);
>> src/sys/error/adebug.c:  MPI_Abort(PETSC_COMM_WORLD,1);
>> src/sys/error/err.c:      If this is called from the main() routine we call 
>> MPI_Abort() instead of
>> src/sys/error/err.c:  if (ismain) MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
>> src/sys/error/errstop.c:  MPI_Abort(PETSC_COMM_WORLD,n);
>> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/signal.c:  if (ierr) MPI_Abort(PETSC_COMM_WORLD,0);
>> src/sys/error/signal.c:  MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
>> src/sys/fsrc/somefort.F:!     when MPI_Abort() is called directly by 
>> CHKERRQ(ierr);
>> src/sys/fsrc/somefort.F:      call MPI_Abort(comm,ierr,nierr)
>> src/sys/ftn-custom/zutils.c:    MPI_Abort(PETSC_COMM_WORLD,1);
>> src/sys/ftn-custom/zutils.c:      MPI_Abort(PETSC_COMM_WORLD,1);
>> src/sys/logging/utils/stagelog.c:    MPI_Abort(MPI_COMM_WORLD, 
>> PETSC_ERR_SUP);
>> src/sys/mpiuni/mpi.c:int MPI_Abort(MPI_Comm comm,int errorcode)
>> src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceCounter(&StartTime)) 
>> MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceFrequency(&PerfFreq)) 
>> MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/mpiuni/mpitime.c:  if (!QueryPerformanceCounter(&CurTime)) 
>> MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/init.c:  in the debugger hence we call abort() instead of 
>> MPI_Abort().
>> src/sys/objects/init.c:void Petsc_MPI_AbortOnError(MPI_Comm 
>> *comm,PetscMPIInt *flag,...)
>> src/sys/objects/init.c:  if (ierr) MPI_Abort(*comm,*flag); /* hopeless so 
>> get out */
>> src/sys/objects/init.c:      ierr = 
>> MPI_Comm_create_errhandler(Petsc_MPI_AbortOnError,&err_handler);CHKERRQ(ierr);
>> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
>> src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
>> src/ts/examples/tutorials/ex48.c:  if (dim < 2) 
>> {MPI_Abort(MPI_COMM_WORLD,1); return;} /* this is needed so that the clang 
>> static analyzer does not generate a warning about variables used by not set 
>> */
>> src/vec/vec/examples/tests/ex32f.F:        call 
>> MPI_Abort(MPI_COMM_WORLD,0,ierr)
>> src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
>> src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
>> src/vec/vec/utils/comb.c:    MPI_Abort(MPI_COMM_SELF,1);
>> src/vec/vec/utils/comb.c:      MPI_Abort(MPI_COMM_SELF,1);
>> 
>>  Junchao,
>> 
>>     Maybe you could fix this and make a MR? I don't know how to organize the 
>> numbering. Should we have a central list of all numbers with macros in 
>> petscerror.h like 
>> 
>> #define PETSC_MPI_ABORT_MPIU_MaxIndex_Local 10 
>> 
>> etc?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>   Barry
>> 
>>> 
>>> Satish
>>> 
>>> 
>>>> 
>>>>  You can return the particular job by clicking on the little circle after 
>>>> the job name and see what happens the next time.
>>>> 
>>>>  Barry
>>>> 
>>>>  It may be the -j and -l options for some systems need to adjusted down 
>>>> slightly and this will prevent these. Satish can that be done in the 
>>>> examples/arch-ci* files with configure options, or in in the runner files 
>>>> or in the yaml file?
>>> 
>>> configure has options --with-make-np --with-make-test-np --with-make-load
>>> 
>>> Satish
>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <[email protected]> wrote:
>>>>> 
>>>>> All failed tests just said "application called MPI_Abort" and had no 
>>>>> stack trace. They are not cuda tests. I updated SF to avoid CUDA  related 
>>>>> initialization if not needed. Let's see the new test result.
>>>>> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
>>>>> # application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>>>> 
>>>>> 
>>>>> --Junchao Zhang
>>>>> 
>>>>> 
>>>>> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> Failed?  Means nothing, send link or cut and paste error
>>>>> 
>>>>> It could be that since we have multiple separate tests running at the 
>>>>> same time they overload the GPU or cause some inconsistent behavior that 
>>>>> doesn't appear every time the tests are run.
>>>>> 
>>>>>  Barry
>>>>> 
>>>>> Maybe we need to sequentialize all the tests that use the GPUs, we just 
>>>>> trust gnumake for the parallelism maybe you could some how add 
>>>>> dependencies to get gnu make to achieve this?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <[email protected]> wrote:
>>>>>> 
>>>>>> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <[email protected]> wrote:
>>>>>>> 
>>>>>>> I saw your update. In PetscCUDAInitialize we have
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>     /* First get the device count */
>>>>>>> 
>>>>>>>     err   = cudaGetDeviceCount(&devCount);
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>     /* next determine the rank and then set the device via a mod */
>>>>>>> 
>>>>>>>     ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
>>>>>>> 
>>>>>>>     device = rank % devCount;
>>>>>>> 
>>>>>>>   }
>>>>>>> 
>>>>>>>   err = cudaSetDevice(device);
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If we rely on the first CUDA call to do initialization, how could CUDA 
>>>>>>> know these MPI stuff.
>>>>>> 
>>>>>> It doesn't, so it does whatever it does (which may be dumb).
>>>>>> 
>>>>>> Are you proposing something?
>>>>>> 
>>>>>> No. My test failed in CI with -cuda_initialize 0 on frog but I could not 
>>>>>> reproduce it. I'm doing investigation. 
>>>>>> 
>>>>>> Barry
>>>>>> 
>>>>>>> 
>>>>>>> --Junchao Zhang
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Fixed the docs. Thanks for pointing out the lack of clarity
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> Barry,
>>>>>>>> 
>>>>>>>> I saw you added these in init.c
>>>>>>>> 
>>>>>>>> 
>>>>>>>> +  -cuda_initialize - do the initialization in PetscInitialize()
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Notes:
>>>>>>>> 
>>>>>>>>  Initializing cuBLAS takes about 1/2 second there it is done by 
>>>>>>>> default in PetscInitialize() before logging begins
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But I did not get otherwise with -cuda_initialize 0, when will cuda be 
>>>>>>>> initialized?
>>>>>>>> --Junchao Zhang

Reply via email to