Hi, Lisandro, It is very cool to see you can make petsc dance with slurm. From you pseudo example, my comments are:
* Do we need a type PetscSigSet instead of explicit int? * Why do PetscSignalBegin/End() have different argument types? Many petsc XxxBegin/End() routines have the same arguments. It is easier to remember for users. * Why do you need PetscSigMask in public header? Can user do PetscSignalClear(PETSC_SIGUSR1) instead of PetscSignalClear(PetscSigMask(PETSC_SIGUSR1))? I like fewer and simpler public APIs. Just my two cents. Thanks. --Junchao Zhang On Thu, Mar 5, 2020 at 4:00 PM Lisandro Dalcin <[email protected]> wrote: > I've implemented some lightweight signal handling facilities. See the > attached header and implementation files for a taste of the current API, > and the pseudo-example code showing how to use it, briefly described below: > > Right now I'm using it to interact with the job scheduler during > (explicit) timestepping. I have being/end signal handling calls around > TSSolve(). A PostStep() routine catches signals and handles them this way: > > * If SIGINT or SIGTERM, I dump a restart file and set converged reason to > USER to stop. > * If SIGUSR1, I dump a restart file and continue timestepping. > * if SIGUSR2, I dump a VTK file and continue timestepping. > > I can send signals to the job with `scancel -s SIG<NAME>`. When the job > time allocation is about to expire, SLURM fist sends SIGTERM and waits some > time before SIGKILL. That time is enough to get a restart file from the > last step, stop timestepping and finalize gracefully. > > I'm not 100% happy with the API, maybe I should make it easier to use. For > example, I could define each PETSC_SIGXXX so that I do need the > macro PetscSigMask(). That would complicate a bit the mapping signal enum > -> name string, though. I could also implement PetscSignalRaise(), it may > be useful, but I'm not sure. > > Do you think this may be of some value for core PETSc? I'm asking before > submitting a MR because that would require writing some docs, I don't want > to do the doc work before knowing your opinion first :-). > > Regards, > > -- > Lisandro Dalcin > ============ > Research Scientist > Extreme Computing Research Center (ECRC) > King Abdullah University of Science and Technology (KAUST) > http://ecrc.kaust.edu.sa/ >
