[slurm-users] Re: [ext] scrontab question

2024-05-07 Thread Hagdorn, Magnus Karl Moritz via slurm-users
Hm, strange. I don't see a problem with the time specs, although I would use */5 * * * * to run something every 5 minutes. In my scrontab I also specify a partition, etc. But I don't think that is necessary. regards magnus On Di, 2024-05-07 at 12:06 -0500, Sandor via slurm-users wrote: > I am

[slurm-users] scrontab question

2024-05-07 Thread Sandor via slurm-users
I am working out the details of scrontab. My initial testing is giving me an unsolvable question Within scrontab editor I have the following example from the slurm documentation: 0,5,10,15,20,25,30,35,40,45,50,55 * * * * /directory/subdirectory/crontest.sh When I save it, scrontab marks the line

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Ole Holm Nielsen via slurm-users
On 5/7/24 15:32, Henderson, Brent via slurm-users wrote: Over the past few days I grabbed some time on the nodes and ran for a few hours.  Looks like I **can** still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so.  Guessing

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Henderson, Brent via slurm-users
Over the past few days I grabbed some time on the nodes and ran for a few hours. Looks like I *can* still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so. Guessing that exonerates cgroups as the cause, but possibly just a good

[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Davide DelVento via slurm-users
Are you seeking something simple rather than sophisticated? If so, you can use the controller local disk for StateSaveLocation and place a cron job (on the same node or somewhere else) to take that data out via e.g. rsync and put it where you need it (NFS?) for the backup control node to use

[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Fabio Ranalli via slurm-users
You can try DRBD https://linbit.com/drbd/ or a shared-disk (clustered) FS like GFS2, OCFS2, etc https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/configuring_gfs2_file_systems/index

[slurm-users] "token expired" errors with auth/slurm

2024-05-07 Thread Fabio Ranalli via slurm-users
Hi there, We've updated to 23.11.6 and replaced MUNGE with SACK. Performance and stability have both been pretty good, but we're occasionally seeing this in the slurmctld.log /[2024-05-07T03:50:16.638] error: decode_jwt: token expired at 1715053769 [2024-05-07T03:50:16.638] error:

[slurm-users] StateSaveLocation and Slurm HA

2024-05-07 Thread Pierre Abele via slurm-users
Hi all, I am looking for a clean way to set up Slurms native high availability feature. I am managing a Slurm cluster with one control node (hosting both slurmctld and slurmdbd), one login node and a few dozen compute nodes. I have a virtual machine that I want to set up as a backup control

[slurm-users] Re: Convergence of Kube and Slurm?

2024-05-07 Thread Bjørn-Helge Mevik via slurm-users
Tim Wickberg via slurm-users writes: > [1] Slinky is not an acronym (neither is Slurm [2]), but loosely > stands for "Slurm in Kubernetes". And not at all inspired by Slinky Dog in Toy Story, I guess. :D -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University