[slurm-users] Re: srun weirdness

2024-05-15 Thread Dj Merrill via slurm-users

I completely missed that, thank you!

-Dj


Laura Hild via slurm-users wrote:

PropagateResourceLimitsExcept won't do it?

Sarlo, Jeffrey S wrote:

You might look at the PropagateResourceLimits and PropagateResourceLimitsExcept 
settings in slurm.conf

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun weirdness

2024-05-15 Thread Laura Hild via slurm-users
PropagateResourceLimitsExcept won't do it?



Od: Dj Merrill via slurm-users 
Poslano: sreda, 15. maj 2024 09:43
Za: slurm-users@lists.schedmd.com
Zadeva: [EXTERNAL] [slurm-users] Re: srun weirdness

Thank you Hemann and Tom!  That was it.

The new cluster has a virtual memory limit on the login host, and the
old cluster did not.

It doesn't look like there is any way to set a default to override the
srun behaviour of passing those resource limits to the shell, so I may
consider removing those limits on the login host so folks don't have to
manually specify this every time.

I really appreciate the help!

-Dj


On 5/15/24 07:20, greent10--- via slurm-users wrote:
> Hi,
>
> When we first migrated to Slurm from PBS one of the strangest issues we hit 
> was that ulimit settings are inherited from the submission host which could 
> explain the different between ssh'ing into the machine (and the default 
> ulimit being applied) and with running a job via srun.
>
> You could use:
>
> srun --propagate=NONE --mem=32G --pty bash
>
> I still find Slurm inheriting ulimit and environment variables from the 
> submission host an odd default behaviour.
>
> Tom
>
> --
> Thomas Green Senior Programmer
> ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB
> Tel: +44 (0)29 208 79269 Fax: +44 (0)29 208 70734
> Email: green...@cardiff.ac.ukWeb: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cardiff.ac.uk_arcca=DwIGaQ=CJqEzB1piLOyyvZjb8YUQw=897kjkV-MEeU1IVizIfc5Q=94Q7i1VRjoZjBYeRehmS8_ns1RjitmxaanQjTsZeT4nVn5jZjxy9ARfUeywCHmmo=zHnwNoh0Qk3EBsMpU-Mum-ARPhKLa65Arp1ndQvw4cU=
>
> Thomas Green Uwch Raglennydd
> ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB
> Ffôn: +44 (0)29 208 79269Ffacs: +44 (0)29 208 70734
> E-bost: green...@caerdydd.ac.uk  Gwefan: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.caerdydd.ac.uk_arcca=DwIGaQ=CJqEzB1piLOyyvZjb8YUQw=897kjkV-MEeU1IVizIfc5Q=94Q7i1VRjoZjBYeRehmS8_ns1RjitmxaanQjTsZeT4nVn5jZjxy9ARfUeywCHmmo=2DevPnVhkvH0gqoWZ8tnKTPTLUPLaYGn_4zx70McYxg=
>
> -Original Message-
> From: Hermann Schwärzler via slurm-users 
> Sent: Wednesday, May 15, 2024 9:45 AM
> To: slurm-users@lists.schedmd.com
> Subject: [slurm-users] Re: srun weirdness
>
> External email to Cardiff University - Take care when replying/opening 
> attachments or links.
> Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
> atodiadau neu ddolenni.
>
>
>
> Hi Dj,
>
> could be a memory-limits related problem. What is the output of
>
>ulimit -l -m -v -s
>
> in both interactive job-shells?
>
> You are using cgroups-v1 now, right?
> In that case what is the respective content of
>
>/sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes
>
> in both shells?
>
> Regards,
> Hemann
>
>
> On 5/14/24 20:38, Dj Merrill via slurm-users wrote:
>> I'm running into a strange issue and I'm hoping another set of brains
>> looking at this might help.  I would appreciate any feedback.
>>
>> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
>> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
>> 23.11.6 on Rocky Linux 9.4 machines.
>>
>> This works perfectly fine on the first cluster:
>>
>> $ srun --mem=32G --pty /bin/bash
>>
>> srun: job 93911 queued and waiting for resources
>> srun: job 93911 has been allocated resources
>>
>> and on the resulting shell on the compute node:
>>
>> $ /mnt/local/ollama/ollama help
>>
>> and the ollama help message appears as expected.
>>
>> However, on the second cluster:
>>
>> $ srun --mem=32G --pty /bin/bash
>> srun: job 3 queued and waiting for resources
>> srun: job 3 has been allocated resources
>>
>> and on the resulting shell on the compute node:
>>
>> $ /mnt/local/ollama/ollama help
>> fatal error: failed to reserve page summary memory runtime stack:
>> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>>   runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
>> pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>>   runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
>> sp=0x7ffe6be32648 pc=0x456b7c
>> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>>   runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8
>> sp=0x7ffe6be326b8
>> pc=0x454565
>> runtime.(*mheap).init(0x127b47e0)
>>   runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
>> pc=0x451885
>> runtime.mallocinit()
>>   runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
>> pc=0x434f97
>> runtime.schedinit()
>>   runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
>> pc=0x464397
>> runtime.rt0_go()
>>   runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8
>> sp=0x7ffe6be327d0 pc=0x49421c
>>
>>
>> If I ssh directly to the same node on that second cluster (skipping
>> Slurm entirely), and run the same 

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Jeffrey Layton via slurm-users
Chris,

Good to hear from you too (I need to post more often so I can see
everyone).

Thanks for the tip. I forgot about looking on the web. This is perfect.

Thanks!

Jeff


On Wed, May 15, 2024 at 11:05 AM Christopher Samuel via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi Jeff!
>
> On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote:
>
> > I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu
> > packages. I now want to install pyxis but it says I need the Slurm
> > sources. In Ubuntu 22.04, is there a package that has the source code?
> > How to download the sources I need from github?
>
> You shouldn't need Github, this should give you what you are after
> (especially the "Download slurm-wlm" section at the end):
>
> https://packages.ubuntu.com/source/jammy/slurm-wlm
>
> Hope that helps!
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users
Forgot to add that Debian/Ubuntu packages are pretty much whatever version was 
stable at the time of the Debian/Ubuntu .0 release. They’ll backport security 
fixes to those older versions as needed, but they never change versions unless 
absolutely required.

The backports repositories may have looser rules, but not the core 
main/contrib/non-free repositories.

From: Renfro, Michael 
Date: Wednesday, May 15, 2024 at 10:19 AM
To: Jeffrey Layton , Lloyd Brown 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: Location of Slurm source packages?
Debian/Ubuntu sources can always be found in at least two ways:


  1.  Pages like https://packages.ubuntu.com/jammy/slurm-wlm (see the .dsc, 
.orig.tar.gz, and .debian.tar.xz links there).
  2.  Commands like ‘apt-get source slurm-wlm’ (may require ‘dpkg-dev’ or other 
packages – probably easiest to install the ‘build-essential’ meta-package).

From: Jeffrey Layton via slurm-users 
Date: Wednesday, May 15, 2024 at 10:01 AM
To: Lloyd Brown 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: Location of Slurm source packages?

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Lloyd,

Good to hear from you! I was hoping to avoid the use of git but that may be the 
only way. The version is 21.08.5. I checked the "old" packages from SchedMD and 
they begin part way through 2024 so that won't work.

I'm very surprised Ubuntu let a package through without a source package for 
it. I'm hoping I'm not seeing the tree through the forest in finding that 
package.

Thanks for the help!

Jeff


On Wed, May 15, 2024 at 10:54 AM Lloyd Brown via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:

Jeff,

I'm not sure what version is in the Ubuntu packages, as I don't think they're 
provided by SchedMD, and I'm having trouble finding the right one on 
packages.ubuntu.com.  Having said that, SchedMD is 
pretty good about using tags in their github repo 
(https://github.com/schedmd/slurm), to represent the releases.  For example, 
the "slurm-23-11-6-1" tag corresponds to release 23.11.6.  It's pretty 
straightforward to clone the repo, and do something like "git checkout -b 
MY_LOCAL_BRANCH_NAME TAG_NAME" to get the version you're after.

Lloyd


--

Lloyd Brown

HPC Systems Administrator

Office of Research Computing

Brigham Young University

http://rc.byu.edu
On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:
Good morning,

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. 
I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 
22.04, is there a package that has the source code? How to download the sources 
I need from github?

Thanks!

Jeff


--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users
Debian/Ubuntu sources can always be found in at least two ways:


  1.  Pages like https://packages.ubuntu.com/jammy/slurm-wlm (see the .dsc, 
.orig.tar.gz, and .debian.tar.xz links there).
  2.  Commands like ‘apt-get source slurm-wlm’ (may require ‘dpkg-dev’ or other 
packages – probably easiest to install the ‘build-essential’ meta-package).

From: Jeffrey Layton via slurm-users 
Date: Wednesday, May 15, 2024 at 10:01 AM
To: Lloyd Brown 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: Location of Slurm source packages?

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Lloyd,

Good to hear from you! I was hoping to avoid the use of git but that may be the 
only way. The version is 21.08.5. I checked the "old" packages from SchedMD and 
they begin part way through 2024 so that won't work.

I'm very surprised Ubuntu let a package through without a source package for 
it. I'm hoping I'm not seeing the tree through the forest in finding that 
package.

Thanks for the help!

Jeff


On Wed, May 15, 2024 at 10:54 AM Lloyd Brown via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:

Jeff,

I'm not sure what version is in the Ubuntu packages, as I don't think they're 
provided by SchedMD, and I'm having trouble finding the right one on 
packages.ubuntu.com.  Having said that, SchedMD is 
pretty good about using tags in their github repo 
(https://github.com/schedmd/slurm), to represent the releases.  For example, 
the "slurm-23-11-6-1" tag corresponds to release 23.11.6.  It's pretty 
straightforward to clone the repo, and do something like "git checkout -b 
MY_LOCAL_BRANCH_NAME TAG_NAME" to get the version you're after.

Lloyd


--

Lloyd Brown

HPC Systems Administrator

Office of Research Computing

Brigham Young University

http://rc.byu.edu
On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:
Good morning,

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. 
I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 
22.04, is there a package that has the source code? How to download the sources 
I need from github?

Thanks!

Jeff


--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Lloyd Brown via slurm-users

Jeff,

Dang.  That's really old.  I'm not sure I would run one that old, to be 
honest.  Too many missing security fixes and added features.  It's never 
been that hard to do a 'git clone' and the normal configure/make/make 
install process with slurm.


Someone else made me aware of this, in case it's easier: 
https://slurm.schedmd.com/quickstart_admin.html#debuild


Lloyd


On 5/15/24 08:57, Jeffrey Layton wrote:

Lloyd,

Good to hear from you! I was hoping to avoid the use of git but that 
may be the only way. The version is 21.08.5. I checked the "old" 
packages from SchedMD and they begin part way through 2024 so that 
won't work.


I'm very surprised Ubuntu let a package through without a source 
package for it. I'm hoping I'm not seeing the tree through the forest 
in finding that package.


Thanks for the help!

Jeff


On Wed, May 15, 2024 at 10:54 AM Lloyd Brown via slurm-users 
 wrote:


Jeff,

I'm not sure what version is in the Ubuntu packages, as I don't
think they're provided by SchedMD, and I'm having trouble finding
the right one on packages.ubuntu.com
.  Having said that, SchedMD is pretty
good about using tags in their github repo
(https://github.com/schedmd/slurm), to represent the releases. 
For example, the "slurm-23-11-6-1" tag corresponds to release
23.11.6. It's pretty straightforward to clone the repo, and do
something like "git checkout -b MY_LOCAL_BRANCH_NAME TAG_NAME" to
get the version you're after.


Lloyd

-- 
Lloyd Brown

HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://rc.byu.edu

On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:

Good morning,

I have an Ubuntu 22.04 server where I installed Slurm from the
Ubuntu packages. I now want to install pyxis but it says I need
the Slurm sources. In Ubuntu 22.04, is there a package that has
the source code? How to download the sources I need from github?

Thanks!

Jeff



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com

To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


--
Lloyd Brown
HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://rc.byu.edu

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Christopher Samuel via slurm-users

Hi Jeff!

On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote:

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu 
packages. I now want to install pyxis but it says I need the Slurm 
sources. In Ubuntu 22.04, is there a package that has the source code? 
How to download the sources I need from github?


You shouldn't need Github, this should give you what you are after 
(especially the "Download slurm-wlm" section at the end):


https://packages.ubuntu.com/source/jammy/slurm-wlm

Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Jeffrey Layton via slurm-users
Lloyd,

Good to hear from you! I was hoping to avoid the use of git but that may be
the only way. The version is 21.08.5. I checked the "old" packages from
SchedMD and they begin part way through 2024 so that won't work.

I'm very surprised Ubuntu let a package through without a source package
for it. I'm hoping I'm not seeing the tree through the forest in finding
that package.

Thanks for the help!

Jeff


On Wed, May 15, 2024 at 10:54 AM Lloyd Brown via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Jeff,
>
> I'm not sure what version is in the Ubuntu packages, as I don't think
> they're provided by SchedMD, and I'm having trouble finding the right one
> on packages.ubuntu.com.  Having said that, SchedMD is pretty good about
> using tags in their github repo (https://github.com/schedmd/slurm), to
> represent the releases.  For example, the "slurm-23-11-6-1" tag corresponds
> to release 23.11.6.  It's pretty straightforward to clone the repo, and do
> something like "git checkout -b MY_LOCAL_BRANCH_NAME TAG_NAME" to get the
> version you're after.
>
> Lloyd
>
> --
> Lloyd Brown
> HPC Systems Administrator
> Office of Research Computing
> Brigham Young Universityhttp://rc.byu.edu
>
> On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:
>
> Good morning,
>
> I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu
> packages. I now want to install pyxis but it says I need the Slurm sources.
> In Ubuntu 22.04, is there a package that has the source code? How to
> download the sources I need from github?
>
> Thanks!
>
> Jeff
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Lloyd Brown via slurm-users

Jeff,

I'm not sure what version is in the Ubuntu packages, as I don't think 
they're provided by SchedMD, and I'm having trouble finding the right 
one on packages.ubuntu.com.  Having said that, SchedMD is pretty good 
about using tags in their github repo 
(https://github.com/schedmd/slurm), to represent the releases. For 
example, the "slurm-23-11-6-1" tag corresponds to release 23.11.6.  It's 
pretty straightforward to clone the repo, and do something like "git 
checkout -b MY_LOCAL_BRANCH_NAME TAG_NAME" to get the version you're after.



Lloyd

--
Lloyd Brown
HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://rc.byu.edu

On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:

Good morning,

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu 
packages. I now want to install pyxis but it says I need the Slurm 
sources. In Ubuntu 22.04, is there a package that has the source code? 
How to download the sources I need from github?


Thanks!

Jeff

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Location of Slurm source packages?

2024-05-15 Thread Jeffrey Layton via slurm-users
Good morning,

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu
packages. I now want to install pyxis but it says I need the Slurm sources.
In Ubuntu 22.04, is there a package that has the source code? How to
download the sources I need from github?

Thanks!

Jeff

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun weirdness

2024-05-15 Thread Dj Merrill via slurm-users

Thank you Hemann and Tom!  That was it.

The new cluster has a virtual memory limit on the login host, and the 
old cluster did not.


It doesn't look like there is any way to set a default to override the 
srun behaviour of passing those resource limits to the shell, so I may 
consider removing those limits on the login host so folks don't have to 
manually specify this every time.


I really appreciate the help!

-Dj


On 5/15/24 07:20, greent10--- via slurm-users wrote:

Hi,

When we first migrated to Slurm from PBS one of the strangest issues we hit was 
that ulimit settings are inherited from the submission host which could explain 
the different between ssh'ing into the machine (and the default ulimit being 
applied) and with running a job via srun.

You could use:

srun --propagate=NONE --mem=32G --pty bash

I still find Slurm inheriting ulimit and environment variables from the 
submission host an odd default behaviour.

Tom

--
Thomas Green Senior Programmer
ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB
Tel: +44 (0)29 208 79269 Fax: +44 (0)29 208 70734
Email: green...@cardiff.ac.ukWeb: http://www.cardiff.ac.uk/arcca

Thomas Green Uwch Raglennydd
ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB
Ffôn: +44 (0)29 208 79269Ffacs: +44 (0)29 208 70734
E-bost: green...@caerdydd.ac.uk  Gwefan: http://www.caerdydd.ac.uk/arcca

-Original Message-
From: Hermann Schwärzler via slurm-users 
Sent: Wednesday, May 15, 2024 9:45 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: srun weirdness

External email to Cardiff University - Take care when replying/opening 
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.



Hi Dj,

could be a memory-limits related problem. What is the output of

   ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right?
In that case what is the respective content of

   /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards,
Hemann


On 5/14/24 20:38, Dj Merrill via slurm-users wrote:

I'm running into a strange issue and I'm hoping another set of brains
looking at this might help.  I would appreciate any feedback.

I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
on Rocky Linux 8.9 machines.  The second cluster is running Slurm
23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources
srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash
srun: job 3 queued and waiting for resources
srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help
fatal error: failed to reserve page summary memory runtime stack:
runtime.throw({0x1240c66?, 0x154fa39a1008?})
  runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
  runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
sp=0x7ffe6be32648 pc=0x456b7c
runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
  runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8
sp=0x7ffe6be326b8
pc=0x454565
runtime.(*mheap).init(0x127b47e0)
  runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
pc=0x451885
runtime.mallocinit()
  runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
pc=0x434f97
runtime.schedinit()
  runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
pc=0x464397
runtime.rt0_go()
  runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8
sp=0x7ffe6be327d0 pc=0x49421c


If I ssh directly to the same node on that second cluster (skipping
Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
command, it works perfectly fine.


My first thought was that it might be related to cgroups.  I switched
the second cluster from cgroups v2 to v1 and tried again, no
difference.  I tried disabling cgroups on the second cluster by
removing all cgroups references in the slurm.conf file but that also
made no difference.


My guess is something changed with regards to srun between these two
Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to
work on the second cluster?  Essentially I need a way to request an
interactive shell through Slurm that is associated with the requested
resources.  Should we be using something other than srun for this?


Thank you,

-Dj






--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Job Invalid Account

2024-05-15 Thread joao.damas--- via slurm-users
Hey,

I've just created a thread on something similar 
(https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd.com/message/MGV6YUIIIPFVUSZPBBXS3YG6BW5K553M/),
 but we have an extra "error" line. Maybe it's related?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] _refresh_assoc_mgr_qos_list: no new list given back keeping cached one

2024-05-15 Thread joao.damas--- via slurm-users
Hi all,

We are doing a simple setup for a Slurm cluster (version 23.11.6). We follow 
the documentation and we are trying a setup still without accounting or 
slurmdbd. The slurm.conf is really simple:
```
ClusterName=Develop
SlurmctldHost=head

# Slurm configuration
AuthType=auth/munge
CryptoType=crypto/munge
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld

# Nodes
NodeName=worker1 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1
NodeName=worker2 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1

# Partitions
PartitionName=develop Default=YES MaxTime=UNLIMITED Nodes="worker1,worker2"
```

When running a simple `srun sleep 10`, all works well and the log file shows:

[2024-05-15T12:34:12.741] sched: _slurm_rpc_allocate_resources JobId=1 
NodeList=worker1 usec=549
[2024-05-15T12:34:22.775] _job_complete: JobId=1 WEXITSTATUS 0
[2024-05-15T12:34:22.775] _job_complete: JobId=1 done

But when creating a scrip with the same sleep command, and submiting using 
`sbatch test.sh`, the log shows:

[2024-05-15T12:35:39.916] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=1 
usec=368
[2024-05-15T12:35:40.000] error: _refresh_assoc_mgr_qos_list: no new list given 
back keeping cached one.
[2024-05-15T12:35:40.000] sched: JobId=2 has invalid account
[2024-05-15T12:35:40.145] sched/backfill: _start_job: Started JobId=2 in 
develop on worker1
[2024-05-15T12:35:50.172] _job_complete: JobId=2 WEXITSTATUS 0
[2024-05-15T12:35:50.172] _job_complete: JobId=2 done

We have the same account with the UID and GID, as said in the documentation. 
Looking at the function that seems to spit out that error 
(https://github.com/SchedMD/slurm/blob/e9f28ede27795f525e62f998cb2d40931d884e8b/src/common/assoc_mgr.c#L1952),
 it appears like there should be some accounting setup? We do not have slurmdbd 
setup and the documentation states we should test basic functionality before 
implementing that daemon.

Any tips? Thanks in advance.
João

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Arnuld via slurm-users
Hi Ward,

Thanks for replying. I tried these but the error is exactly the same
(everything
under "/shared" has permissions 777 and owned by "nobody:nogroup"):

/etc/slurm/slurm.conf
JobContainerType=job_container/tmpfs
Prolog=/shared/SlurmScripts/prejob
PrologFlags=contain

/etc/slurm/job_container.conf
#
AutoBasePath=true
BasePath=/shared/BasePath

/shared/SlurmScripts/prejob
#!/usr/bin/env bash
MY_XDG_RUNTIME_DIR=/shared/SlurmXDG
mkdir -p $MY_XDG_RUNTIME_DIR
echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR"



On Wed, May 15, 2024 at 2:28 PM Ward Poelmans via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi,
>
> This is systemd, not slurm. We've also seen it being created and removed.
> As far as I understood something about the session that systemd clean up.
> We've worked around by adding this to the prolog:
>
> MY_XDG_RUNTIME_DIR=/dev/shm/${USER}
> mkdir -p $MY_XDG_RUNTIME_DIR
> echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR"
>
> (in combination with private tmpfs per job).
>
> Ward
>
> On 15/05/2024 10:14, Arnuld via slurm-users wrote:
> > I am using the latest slurm. It  runs fine for scripts. But if I give it
> a container then it kills it as soon as I submit the job. Is slurm cleaning
> up the $XDG_RUNTIME_DIR before it should?  This is the log:
> >
> > [2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0
> TaskId=-1
> > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
> argv[0]=/bin/sh
> > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
> argv[1]=-c
> > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
> argv[2]=crun --rootless=true --root=/run/user/1000/ state
> slurm2.acog.90.0.-1
> > [2024-05-15T08:00:35.167] [90.0] debug:  _get_container_state:
> RunTimeQuery rc:256 output:error opening file
> `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory
> >
> > [2024-05-15T08:00:35.167] [90.0] error: _get_container_state:
> RunTimeQuery failed rc:256 output:error opening file
> `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory
> >
> > [2024-05-15T08:00:35.167] [90.0] debug:  container already dead
> > [2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0
> pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/
> > [2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0
> TaskId=0
> > [2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1
> pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/
> > [2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step
> (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
> > [2024-05-15T08:00:35.275] debug3: in the service_connection
> > [2024-05-15T08:00:35.278] debug2: Start processing RPC:
> REQUEST_TERMINATE_JOB
> > [2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> > [2024-05-15T08:00:35.278] debug:  _rpc_terminate_job: uid = 64030
> JobId=90
> > [2024-05-15T08:00:35.278] debug:  credential for job 90 revoked
> >
> >
> >
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun weirdness

2024-05-15 Thread greent10--- via slurm-users
Hi,

When we first migrated to Slurm from PBS one of the strangest issues we hit was 
that ulimit settings are inherited from the submission host which could explain 
the different between ssh'ing into the machine (and the default ulimit being 
applied) and with running a job via srun.

You could use:

srun --propagate=NONE --mem=32G --pty bash

I still find Slurm inheriting ulimit and environment variables from the 
submission host an odd default behaviour.

Tom

--
Thomas Green Senior Programmer
ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB
Tel: +44 (0)29 208 79269 Fax: +44 (0)29 208 70734
Email: green...@cardiff.ac.ukWeb: http://www.cardiff.ac.uk/arcca

Thomas Green Uwch Raglennydd
ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB
Ffôn: +44 (0)29 208 79269Ffacs: +44 (0)29 208 70734
E-bost: green...@caerdydd.ac.uk  Gwefan: http://www.caerdydd.ac.uk/arcca

-Original Message-
From: Hermann Schwärzler via slurm-users  
Sent: Wednesday, May 15, 2024 9:45 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: srun weirdness

External email to Cardiff University - Take care when replying/opening 
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.



Hi Dj,

could be a memory-limits related problem. What is the output of

  ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right?
In that case what is the respective content of

  /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards,
Hemann


On 5/14/24 20:38, Dj Merrill via slurm-users wrote:
> I'm running into a strange issue and I'm hoping another set of brains 
> looking at this might help.  I would appreciate any feedback.
>
> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8 
> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> 23.11.6 on Rocky Linux 9.4 machines.
>
> This works perfectly fine on the first cluster:
>
> $ srun --mem=32G --pty /bin/bash
>
> srun: job 93911 queued and waiting for resources
> srun: job 93911 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
>
> and the ollama help message appears as expected.
>
> However, on the second cluster:
>
> $ srun --mem=32G --pty /bin/bash
> srun: job 3 queued and waiting for resources
> srun: job 3 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
> fatal error: failed to reserve page summary memory runtime stack:
> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>  runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 
> pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>  runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> sp=0x7ffe6be32648 pc=0x456b7c
> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>  runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 
> sp=0x7ffe6be326b8
> pc=0x454565
> runtime.(*mheap).init(0x127b47e0)
>  runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> pc=0x451885
> runtime.mallocinit()
>  runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> pc=0x434f97
> runtime.schedinit()
>  runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> pc=0x464397
> runtime.rt0_go()
>  runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 
> sp=0x7ffe6be327d0 pc=0x49421c
>
>
> If I ssh directly to the same node on that second cluster (skipping 
> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> command, it works perfectly fine.
>
>
> My first thought was that it might be related to cgroups.  I switched 
> the second cluster from cgroups v2 to v1 and tried again, no 
> difference.  I tried disabling cgroups on the second cluster by 
> removing all cgroups references in the slurm.conf file but that also 
> made no difference.
>
>
> My guess is something changed with regards to srun between these two 
> Slurm versions, but I'm not sure what.
>
> Any thoughts on what might be happening and/or a way to get this to 
> work on the second cluster?  Essentially I need a way to request an 
> interactive shell through Slurm that is associated with the requested 
> resources.  Should we be using something other than srun for this?
>
>
> Thank you,
>
> -Dj
>
>
>

--
slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send 
an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Ward Poelmans via slurm-users

Hi,

This is systemd, not slurm. We've also seen it being created and removed. As 
far as I understood something about the session that systemd clean up. We've 
worked around by adding this to the prolog:

MY_XDG_RUNTIME_DIR=/dev/shm/${USER}
mkdir -p $MY_XDG_RUNTIME_DIR
echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR"

(in combination with private tmpfs per job).

Ward

On 15/05/2024 10:14, Arnuld via slurm-users wrote:

I am using the latest slurm. It  runs fine for scripts. But if I give it a 
container then it kills it as soon as I submit the job. Is slurm cleaning up 
the $XDG_RUNTIME_DIR before it should?  This is the log:

[2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0 
TaskId=-1
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command 
argv[0]=/bin/sh
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command 
argv[1]=-c
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command 
argv[2]=crun --rootless=true --root=/run/user/1000/ state slurm2.acog.90.0.-1
[2024-05-15T08:00:35.167] [90.0] debug:  _get_container_state: RunTimeQuery 
rc:256 output:error opening file `/run/user/1000/slurm2.acog.90.0.-1/status`: 
No such file or directory

[2024-05-15T08:00:35.167] [90.0] error: _get_container_state: RunTimeQuery 
failed rc:256 output:error opening file 
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory

[2024-05-15T08:00:35.167] [90.0] debug:  container already dead
[2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0 
pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/
[2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0 
TaskId=0
[2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1 
pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/
[2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step 
(rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
[2024-05-15T08:00:35.275] debug3: in the service_connection
[2024-05-15T08:00:35.278] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug:  _rpc_terminate_job: uid = 64030 JobId=90
[2024-05-15T08:00:35.278] debug:  credential for job 90 revoked







smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun weirdness

2024-05-15 Thread Hermann Schwärzler via slurm-users

Hi Dj,

could be a memory-limits related problem. What is the output of

 ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right?
In that case what is the respective content of

 /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards,
Hemann


On 5/14/24 20:38, Dj Merrill via slurm-users wrote:
I'm running into a strange issue and I'm hoping another set of brains 
looking at this might help.  I would appreciate any feedback.


I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8 
on Rocky Linux 8.9 machines.  The second cluster is running Slurm 
23.11.6 on Rocky Linux 9.4 machines.


This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources
srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash
srun: job 3 queued and waiting for resources
srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help
fatal error: failed to reserve page summary memory
runtime stack:
runtime.throw({0x1240c66?, 0x154fa39a1008?})
     runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 
pc=0x4605dc

runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
     runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 
sp=0x7ffe6be32648 pc=0x456b7c

runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
     runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 
pc=0x454565

runtime.(*mheap).init(0x127b47e0)
     runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 
pc=0x451885

runtime.mallocinit()
     runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 
pc=0x434f97

runtime.schedinit()
     runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 
pc=0x464397

runtime.rt0_go()
     runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 
pc=0x49421c



If I ssh directly to the same node on that second cluster (skipping 
Slurm entirely), and run the same "/mnt/local/ollama/ollama help" 
command, it works perfectly fine.



My first thought was that it might be related to cgroups.  I switched 
the second cluster from cgroups v2 to v1 and tried again, no 
difference.  I tried disabling cgroups on the second cluster by removing 
all cgroups references in the slurm.conf file but that also made no 
difference.



My guess is something changed with regards to srun between these two 
Slurm versions, but I'm not sure what.


Any thoughts on what might be happening and/or a way to get this to work 
on the second cluster?  Essentially I need a way to request an 
interactive shell through Slurm that is associated with the requested 
resources.  Should we be using something other than srun for this?



Thank you,

-Dj





--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Arnuld via slurm-users
I am using the latest slurm. It  runs fine for scripts. But if I give it a
container then it kills it as soon as I submit the job. Is slurm cleaning
up the $XDG_RUNTIME_DIR before it should?  This is the log:

[2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0
TaskId=-1
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[0]=/bin/sh
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[1]=-c
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[2]=crun --rootless=true --root=/run/user/1000/ state
slurm2.acog.90.0.-1
[2024-05-15T08:00:35.167] [90.0] debug:  _get_container_state: RunTimeQuery
rc:256 output:error opening file
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory

[2024-05-15T08:00:35.167] [90.0] error: _get_container_state: RunTimeQuery
failed rc:256 output:error opening file
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory

[2024-05-15T08:00:35.167] [90.0] debug:  container already dead
[2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0
pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/
[2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0
TaskId=0
[2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1
pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/
[2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step
(rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
[2024-05-15T08:00:35.275] debug3: in the service_connection
[2024-05-15T08:00:35.278] debug2: Start processing RPC:
REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug:  _rpc_terminate_job: uid = 64030 JobId=90
[2024-05-15T08:00:35.278] debug:  credential for job 90 revoked

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Best practice for jobs resuming from suspended state

2024-05-15 Thread Paul Jones via slurm-users
Hi,

We use PreemptMode and PriorityTier within Slurm to suspend low priority jobs 
when more urgent work needs to be done. This generally works well, but on 
occasion resumed jobs fail to restart - which is to say Slurm sets the job 
status to running but the actual code doesn't recover from being suspended.

Technically everything is working as expected, but I wondered if there was any 
best practice to pass onto users about how to cope with this state? Obviously 
not a direct Slurm question, but wondered if others had experience with this 
and any advice on how best to limit the impact?

Thanks,
Paul


--

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com