Source: slurm-wlm
Version: 24.11.5-4
Severity: wishlist
Tags: patch

Dear Maintainer,

Please consider applying the attached patch to the Debian slurm-wlm
package. It adds a new SlurmctldParameters option,
periodic_check_interval=#, that makes the slurmctld periodic background
loop interval configurable. The default is the existing PERIODIC_TIMEOUT
value (30s), so behavior is unchanged unless the option is set.

Motivation
----------
The slurmctld background thread uses a hard-coded PERIODIC_TIMEOUT (30s)
for the periodic timelimit / reservation / node-timer checks. After a
suspended or powered-down node resumes and registers, the queued job
waits for the next periodic pass before transitioning to RUNNING. The
delay is therefore up to one interval (0..30s depending on timing).

For on-prem clusters that suspend idle nodes and resume them via
wake-on-LAN, the wake/boot/register path is already fast; the remaining
delay is purely this controller-side poll. Making it tunable lets such
sites bring post-registration scheduling latency down to match the fast
wake.

Live A/B validation
-------------------
Tested on a 2-node on-prem Debian 13 cluster with a locally rebuilt
package (24.11.5-4+periodiccheck1), same node and workflow:

  periodic_check_interval=2   registration-to-RUNNING = 1s
  periodic_check_interval=30  registration-to-RUNNING = 14s

(14s rather than 30s because the loop is not anchored to registration;
the penalty is time-until-next-pass, bounded by the interval.)

Controller log excerpt for the 30s case:
  23:20:48 Node nyc1 now responding
  23:21:03 job_time_limit: Configuration for JobId=125 complete

These figures isolate the Slurm controller overhead only; HW+OS
wake/boot time is separate and unaffected by this option.

Patch details
-------------
- New helper get_periodic_check_interval() in src/slurmctld/controller.c
  (caches parsed value until slurm.conf last_update changes; rejects 0
  and INFINITE, falls back to default).
- Three call sites switched from the PERIODIC_TIMEOUT literal:
  _slurmctld_background() in controller.c, job_time_limit() and
  send_job_warn_signal() in job_mgr.c.
- Docs: doc/man/man5/slurm.conf.5, doc/html/power_save.shtml.
- Test: testsuite/python/tests/test_141_1.py (new test exercising the
  reduced-interval cloud-node resume path).

The attached patch is a quilt-format patch with a DEP-3 header; it
applies cleanly to the slurm-wlm 24.11.5-4 source and I have built and
installed the resulting packages on Debian 13.

Upstream
--------
Also submitted upstream to SchedMD:
  https://support.schedmd.com/show_bug.cgi?id=25294

-- System Information:
Debian Release: 13.4
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500,
'stable')
Architecture: amd64 (x86_64)
Kernel: Linux 6.12.85+deb13-amd64 (SMP w/24 CPU threads; PREEMPT)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not
set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Description: add periodic_check_interval SlurmctldParameters option
Author: Dmitri Khokhlov <[email protected]>
Forwarded: https://support.schedmd.com/show_bug.cgi?id=25294
Last-Update: 2026-05-27

--- a/doc/html/power_save.shtml
+++ b/doc/html/power_save.shtml
@@ -296,6 +296,17 @@
         State=CLOUD nodes, the default is 90.</p>
       </dd>

+      <dt id="periodic_check_interval"><b>periodic_check_interval=#</b><a 
class=
+      "slurm_link" href="#periodic_check_interval"></a>
+      </dt>
+
+      <dd>
+        <p>How often slurmctld runs periodic background checks, including job
+        time-limit handling, reservation checks, and node timer checks. Lower
+        values can reduce the delay before jobs progress after nodes resume
+        and register. Default is 30 seconds.</p>
+      </dd>
+
       <dt id="power_save_interval"><b>power_save_interval=#</b><a class=
       "slurm_link" href="#power_save_interval"></a>
       </dt>
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -2585,7 +2585,7 @@
 be called before and/or after execution of each task spawned as
 part of a user's job step. Default location is "plugstack.conf"
 in the same directory as the system slurm.conf. For more information
-on SPANK plugins, see the \fBspank\fR(7) manual.
+on SPANK plugins, see the \fBspank\fR(8) manual.
 .IP

 .TP
@@ -4981,6 +4981,14 @@
 .IP

 .TP
+\fBperiodic_check_interval\fR=\#
+How often slurmctld runs periodic background checks, including job time-limit
+handling, reservation checks, and node timer checks. Lower values can reduce
+the delay before jobs progress after nodes resume and register. Default is 30
+seconds.
+.IP
+
+.TP
 \fBpower_save_interval\fR
 How often the power_save thread looks to resume and suspend nodes. The
 power_save thread will do work sooner if there are node state changes. Default
@@ -5770,7 +5778,7 @@

 .TP
 \fBfm_url\fR
-If set, slurm will use the configured URL to interface with the fabric
+If set, slurm will use the configured URL to interace with the fabric
 manager to enable Slingshot hardware collectives.
 Note \fBenable_stepmgr\fR needs to be set for hardware collectives to run.
 .IP
@@ -8318,4 +8326,4 @@
 \fBgetrlimit\fR(2), \fBgres.conf\fR(5), \fBgroup\fR(5), \fBhostname\fR(1),
 \fBscontrol\fR(1), \fBslurmctld\fR(8), \fBslurmd\fR(8),
 \fBslurmdbd\fR(8), \fBslurmdbd.conf\fR(5), \fBsrun\fR(1),
-\fBspank\fR(7), \fBsyslog\fR(3), \fBtopology.conf\fR(5)
+\fBspank\fR(8), \fBsyslog\fR(3), \fBtopology.conf\fR(5)
--- a/src/slurmctld/controller.c
+++ b/src/slurmctld/controller.c
@@ -70,6 +70,7 @@
 #include "src/common/log.h"
 #include "src/common/macros.h"
 #include "src/common/pack.h"
+#include "src/common/parse_value.h"
 #include "src/common/port_mgr.h"
 #include "src/common/proc_args.h"
 #include "src/common/read_config.h"
@@ -559,6 +560,44 @@
        }
 }

+static void _close_acct_storage_conn(void)
+{
+       if (acct_db_conn)
+               acct_storage_g_close_connection(&acct_db_conn);
+
+       acct_storage_g_fini();
+       slurm_persist_conn_recv_server_fini();
+}
+
+extern uint16_t get_periodic_check_interval(void)
+{
+       static time_t config_update = (time_t) -1;
+       static uint16_t periodic_check_interval = PERIODIC_TIMEOUT;
+       char *tmp_ptr;
+       uint16_t tmp_interval = PERIODIC_TIMEOUT;
+
+       if (config_update == slurm_conf.last_update)
+               return periodic_check_interval;
+
+       if ((tmp_ptr = conf_get_opt_str(slurm_conf.slurmctld_params,
+                                       "periodic_check_interval="))) {
+               if (s_p_handle_uint16(&tmp_interval,
+                                     "periodic_check_interval",
+                                     tmp_ptr) ||
+                   !tmp_interval || (tmp_interval == INFINITE16)) {
+                       error("SlurmctldParameters option 
periodic_check_interval=%s "
+                             "is invalid, using default %u",
+                             tmp_ptr, PERIODIC_TIMEOUT);
+                       tmp_interval = PERIODIC_TIMEOUT;
+               }
+               xfree(tmp_ptr);
+       }
+
+       periodic_check_interval = tmp_interval;
+       config_update = slurm_conf.last_update;
+
+       return periodic_check_interval;
+}
 /* main - slurmctld main function, start various threads and process RPCs */
 int main(int argc, char **argv)
 {
@@ -2492,7 +2531,8 @@

                validate_all_reservations(true);

-               if (difftime(now, last_timelimit_time) >= PERIODIC_TIMEOUT) {
+               if (difftime(now, last_timelimit_time) >=
+                   get_periodic_check_interval()) {
                        lock_slurmctld(job_write_lock);
                        now = time(NULL);
                        last_timelimit_time = now;
--- a/src/slurmctld/job_mgr.c
+++ b/src/slurmctld/job_mgr.c
@@ -9306,7 +9306,8 @@
                }

                /* Give srun command warning message about pending timeout */
-               if (job_ptr->end_time <= (now + PERIODIC_TIMEOUT * 2))
+               if (job_ptr->end_time <=
+                   (now + get_periodic_check_interval() * 2))
                        srun_timeout (job_ptr);

                /*
@@ -18997,7 +18998,8 @@
            !(job_ptr->warn_flags & WARN_SENT) &&
            (ignore_time ||
             (job_ptr->warn_time &&
-             ((job_ptr->warn_time + PERIODIC_TIMEOUT + time(NULL)) >=
+             ((job_ptr->warn_time + get_periodic_check_interval() +
+               time(NULL)) >=
               job_ptr->end_time)))) {
                /*
                 * If --signal B option was not specified,
--- a/src/slurmctld/slurmctld.h
+++ b/src/slurmctld/slurmctld.h
@@ -2149,6 +2149,7 @@
  * resume_after - Resume a down|drain node after resume_after time.
  */
 extern void check_node_timers(void);
+extern uint16_t get_periodic_check_interval(void);

 /*
  * Send warning signal to job before end time.
--- a/testsuite/python/tests/test_141_1.py
+++ b/testsuite/python/tests/test_141_1.py
@@ -12,6 +12,7 @@
 suspend_time = 10
 suspend_timeout = 10
 resume_timeout = 10
+periodic_check_interval = 2


 @pytest.fixture(scope="module", autouse=True)
@@ -37,6 +38,9 @@
     # Mark nodes as IDLE, regardless of current state, when suspending nodes 
with
     # SuspendProgram so that nodes will be eligible to be resumed at a later 
time
     atf.require_config_parameter_includes("SlurmctldParameters", 
"idle_on_node_suspend")
+    atf.require_config_parameter_includes(
+        "SlurmctldParameters", 
f"periodic_check_interval={periodic_check_interval}"
+    )

     # Register the cloud node in slurm.conf
     atf.require_config_parameter(
@@ -106,6 +110,40 @@


 # Tests
+def test_periodic_check_interval():
+    """Test periodic_check_interval advances a CONFIGURING job after node 
registration."""
+    job_id = atf.submit_job_sbatch("-p cloud1 --wrap 'srun sleep 10'", 
fatal=True)
+    atf.wait_for_node_state(f"{node_prefix}1", "ALLOCATED", timeout=5, 
fatal=True)
+    atf.wait_for_node_state(f"{node_prefix}1", "POWERING_UP", fatal=True)
+    assert "CONFIGURING" == atf.get_job_parameter(
+        job_id, "JobState", default="NOT_FOUND", quiet=True
+    ), "Submitted job should be in CONFIGURING state while its ALLOCATED cloud 
node is POWERING_UP"
+
+    # TODO: Wait 2 seconds to avoid race condition between slurmd and slurmctld
+    #       Remove once bug 16459 is fixed.
+    time.sleep(2)
+
+    atf.run_command(
+        f"{atf.properties['slurm-sbin-dir']}/slurmd -b -N {node_prefix}1 
--conf 'feature=f1'",
+        fatal=True,
+        user="root",
+    )
+
+    atf.wait_for_node_state(
+        f"{node_prefix}1",
+        "POWERING_UP",
+        reverse=True,
+        timeout=resume_timeout + 5,
+        fatal=True,
+    )
+
+    assert atf.wait_for_job_state(
+        job_id,
+        "RUNNING",
+        timeout=periodic_check_interval + 5,
+    )
+
+
 # Test state cycle of cloud nodes: POWERED_DOWN, POWERING_UP, IDLE,
 # POWERING_DOWN, POWERED_DOWN
 def test_cloud_state_cycle():

Reply via email to