Re: [slurm-users] epilog when job is killed for max time

2018-11-08 Thread Josep Manel Andrés Moscardó
Hi, Somebody else gave me this piece of code (I hope he doesn't mind me sharing it :) , at least it is how they do it: #!/bin/bash #SBATCH --signal=B:USR1@300 #<-- This will make Slurm send signal USR1 to the bash process 300 seconds before the time limit #SBATCH -t 00:06:00 resubmit()

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Miguel A . Sánchez
Hi and thanks for all your answers and sorry for the delay in my answer. Yesterday I have installed in the controller machine the Slurm-18.08.3 to check if with this last release the Seff command is working fine. The behavior has improve but I still receive a error message: # /usr/local/slurm-18.

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Marcus Wagner
Hi Miguel, this is because SchedMD changed the stats field. There exists no more rss_max, cmp. line 225 of seff. You need to evaluate the field stats{tres_usage_in_max}, and there the value after '2=', but this is the memory value in bytes instead of kbytes, so this should be divided by 1024

Re: [slurm-users] epilog when job is killed for max time

2018-11-08 Thread Noam Bernstein
Thanks - that's an awesome, yet horrible, hack :) Noam > On Nov 8, 2018, at 3:26 AM, Josep Manel Andrés Moscardó > wrote: > > Hi, > Somebody else gave me this piece of code (I hope he doesn't mind me sharing > it :) , at least it

[slurm-users] Slurm missing non primary group memberships

2018-11-08 Thread Aravindh Sampathkumar
Hello all. I'm seeing something strange related to group memberships and how it bothers Slurm. Appreciate any ideas to understand what is going on. It appears that only the primary group of the user is propagated when Slurm runs a job. The additional group memberships vanish. This is not expected

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Paddy Doyle
Hi all, It looks like we can use the api to avoid having to manually parse the '2=' value from the stats{tres_usage_in_max} value. I've submitted a bug report and patch: https://bugs.schedmd.com/show_bug.cgi?id=6004 The minimal changes needed would be in the attched seff.patch. Hope that helps

[slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Brian Andrus
All, I am seeing what looks like the same issue as https://bugs.schedmd.com/show_bug.cgi?id=2119 Where, slurmctld is not picking up new accounts unless it is restarted. I have 4 clusters (non-federated), all using the same slurmdbd When I added an association for user name=me cluster=DevOps accou

Re: [slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Marcin Stolarek
I have very similar issue for quite a time and I was unable to find its root cause. Are you using sssd and AD as a data source with only a subtree of entries searched - this is my case. Did you disable users enumeration? It also what I have. I didn’t find ang evidence that it’s related but... may

Re: [slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Brian Andrus
We use sssd with realmd enumeration is off. Brian Andrus On 11/8/2018 11:26 AM, Marcin Stolarek wrote: I have very similar issue for quite a time and I was unable to find its root cause. Are you using sssd and AD as a data source with only a subtree of entries searched - this is my case. Di

Re: [slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Chris Samuel
On Friday, 9 November 2018 5:38:22 AM AEDT Brian Andrus wrote: > Where, slurmctld is not picking up new accounts unless it is restarted. This is usually because slurmdbd cannot connect back to the slurmctld on the management node to do the RPC to tell it that a new account/user/etc has appeared

[slurm-users] virtual memory limit exceeded

2018-11-08 Thread Noam Bernstein
Can anyone shed some light on where the _virtual_ memory limit comes from? We're getting jobs killed with the message slurmstepd: error: Step 3664.0 exceeded virtual memory limit (79348101120 > 72638634393), being killed Is this a limit that's dictated by cgroup.conf or by some srun option (like

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Marcus Wagner
Thanks Paddy, just something learned again ;) Best Marcus On 11/08/2018 05:07 PM, Paddy Doyle wrote: Hi all, It looks like we can use the api to avoid having to manually parse the '2=' value from the stats{tres_usage_in_max} value. I've submitted a bug report and patch: https://bugs.schedm