Source: slurm-wlm Version: 24.11.5-4 Severity: wishlist Tags: patch Dear Maintainer,
Please consider applying the attached patch to the Debian slurm-wlm package. It adds a new SlurmctldParameters option, periodic_check_interval=#, that makes the slurmctld periodic background loop interval configurable. The default is the existing PERIODIC_TIMEOUT value (30s), so behavior is unchanged unless the option is set. Motivation ---------- The slurmctld background thread uses a hard-coded PERIODIC_TIMEOUT (30s) for the periodic timelimit / reservation / node-timer checks. After a suspended or powered-down node resumes and registers, the queued job waits for the next periodic pass before transitioning to RUNNING. The delay is therefore up to one interval (0..30s depending on timing). For on-prem clusters that suspend idle nodes and resume them via wake-on-LAN, the wake/boot/register path is already fast; the remaining delay is purely this controller-side poll. Making it tunable lets such sites bring post-registration scheduling latency down to match the fast wake. Live A/B validation ------------------- Tested on a 2-node on-prem Debian 13 cluster with a locally rebuilt package (24.11.5-4+periodiccheck1), same node and workflow: periodic_check_interval=2 registration-to-RUNNING = 1s periodic_check_interval=30 registration-to-RUNNING = 14s (14s rather than 30s because the loop is not anchored to registration; the penalty is time-until-next-pass, bounded by the interval.) Controller log excerpt for the 30s case: 23:20:48 Node nyc1 now responding 23:21:03 job_time_limit: Configuration for JobId=125 complete These figures isolate the Slurm controller overhead only; HW+OS wake/boot time is separate and unaffected by this option. Patch details ------------- - New helper get_periodic_check_interval() in src/slurmctld/controller.c (caches parsed value until slurm.conf last_update changes; rejects 0 and INFINITE, falls back to default). - Three call sites switched from the PERIODIC_TIMEOUT literal: _slurmctld_background() in controller.c, job_time_limit() and send_job_warn_signal() in job_mgr.c. - Docs: doc/man/man5/slurm.conf.5, doc/html/power_save.shtml. - Test: testsuite/python/tests/test_141_1.py (new test exercising the reduced-interval cloud-node resume path). The attached patch is a quilt-format patch with a DEP-3 header; it applies cleanly to the slurm-wlm 24.11.5-4 source and I have built and installed the resulting packages on Debian 13. Upstream -------- Also submitted upstream to SchedMD: https://support.schedmd.com/show_bug.cgi?id=25294 -- System Information: Debian Release: 13.4 APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 6.12.85+deb13-amd64 (SMP w/24 CPU threads; PREEMPT) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled
Description: add periodic_check_interval SlurmctldParameters option Author: Dmitri Khokhlov <[email protected]> Forwarded: https://support.schedmd.com/show_bug.cgi?id=25294 Last-Update: 2026-05-27 --- a/doc/html/power_save.shtml +++ b/doc/html/power_save.shtml @@ -296,6 +296,17 @@ State=CLOUD nodes, the default is 90.</p> </dd> + <dt id="periodic_check_interval"><b>periodic_check_interval=#</b><a class= + "slurm_link" href="#periodic_check_interval"></a> + </dt> + + <dd> + <p>How often slurmctld runs periodic background checks, including job + time-limit handling, reservation checks, and node timer checks. Lower + values can reduce the delay before jobs progress after nodes resume + and register. Default is 30 seconds.</p> + </dd> + <dt id="power_save_interval"><b>power_save_interval=#</b><a class= "slurm_link" href="#power_save_interval"></a> </dt> --- a/doc/man/man5/slurm.conf.5 +++ b/doc/man/man5/slurm.conf.5 @@ -2585,7 +2585,7 @@ be called before and/or after execution of each task spawned as part of a user's job step. Default location is "plugstack.conf" in the same directory as the system slurm.conf. For more information -on SPANK plugins, see the \fBspank\fR(7) manual. +on SPANK plugins, see the \fBspank\fR(8) manual. .IP .TP @@ -4981,6 +4981,14 @@ .IP .TP +\fBperiodic_check_interval\fR=\# +How often slurmctld runs periodic background checks, including job time-limit +handling, reservation checks, and node timer checks. Lower values can reduce +the delay before jobs progress after nodes resume and register. Default is 30 +seconds. +.IP + +.TP \fBpower_save_interval\fR How often the power_save thread looks to resume and suspend nodes. The power_save thread will do work sooner if there are node state changes. Default @@ -5770,7 +5778,7 @@ .TP \fBfm_url\fR -If set, slurm will use the configured URL to interface with the fabric +If set, slurm will use the configured URL to interace with the fabric manager to enable Slingshot hardware collectives. Note \fBenable_stepmgr\fR needs to be set for hardware collectives to run. .IP @@ -8318,4 +8326,4 @@ \fBgetrlimit\fR(2), \fBgres.conf\fR(5), \fBgroup\fR(5), \fBhostname\fR(1), \fBscontrol\fR(1), \fBslurmctld\fR(8), \fBslurmd\fR(8), \fBslurmdbd\fR(8), \fBslurmdbd.conf\fR(5), \fBsrun\fR(1), -\fBspank\fR(7), \fBsyslog\fR(3), \fBtopology.conf\fR(5) +\fBspank\fR(8), \fBsyslog\fR(3), \fBtopology.conf\fR(5) --- a/src/slurmctld/controller.c +++ b/src/slurmctld/controller.c @@ -70,6 +70,7 @@ #include "src/common/log.h" #include "src/common/macros.h" #include "src/common/pack.h" +#include "src/common/parse_value.h" #include "src/common/port_mgr.h" #include "src/common/proc_args.h" #include "src/common/read_config.h" @@ -559,6 +560,44 @@ } } +static void _close_acct_storage_conn(void) +{ + if (acct_db_conn) + acct_storage_g_close_connection(&acct_db_conn); + + acct_storage_g_fini(); + slurm_persist_conn_recv_server_fini(); +} + +extern uint16_t get_periodic_check_interval(void) +{ + static time_t config_update = (time_t) -1; + static uint16_t periodic_check_interval = PERIODIC_TIMEOUT; + char *tmp_ptr; + uint16_t tmp_interval = PERIODIC_TIMEOUT; + + if (config_update == slurm_conf.last_update) + return periodic_check_interval; + + if ((tmp_ptr = conf_get_opt_str(slurm_conf.slurmctld_params, + "periodic_check_interval="))) { + if (s_p_handle_uint16(&tmp_interval, + "periodic_check_interval", + tmp_ptr) || + !tmp_interval || (tmp_interval == INFINITE16)) { + error("SlurmctldParameters option periodic_check_interval=%s " + "is invalid, using default %u", + tmp_ptr, PERIODIC_TIMEOUT); + tmp_interval = PERIODIC_TIMEOUT; + } + xfree(tmp_ptr); + } + + periodic_check_interval = tmp_interval; + config_update = slurm_conf.last_update; + + return periodic_check_interval; +} /* main - slurmctld main function, start various threads and process RPCs */ int main(int argc, char **argv) { @@ -2492,7 +2531,8 @@ validate_all_reservations(true); - if (difftime(now, last_timelimit_time) >= PERIODIC_TIMEOUT) { + if (difftime(now, last_timelimit_time) >= + get_periodic_check_interval()) { lock_slurmctld(job_write_lock); now = time(NULL); last_timelimit_time = now; --- a/src/slurmctld/job_mgr.c +++ b/src/slurmctld/job_mgr.c @@ -9306,7 +9306,8 @@ } /* Give srun command warning message about pending timeout */ - if (job_ptr->end_time <= (now + PERIODIC_TIMEOUT * 2)) + if (job_ptr->end_time <= + (now + get_periodic_check_interval() * 2)) srun_timeout (job_ptr); /* @@ -18997,7 +18998,8 @@ !(job_ptr->warn_flags & WARN_SENT) && (ignore_time || (job_ptr->warn_time && - ((job_ptr->warn_time + PERIODIC_TIMEOUT + time(NULL)) >= + ((job_ptr->warn_time + get_periodic_check_interval() + + time(NULL)) >= job_ptr->end_time)))) { /* * If --signal B option was not specified, --- a/src/slurmctld/slurmctld.h +++ b/src/slurmctld/slurmctld.h @@ -2149,6 +2149,7 @@ * resume_after - Resume a down|drain node after resume_after time. */ extern void check_node_timers(void); +extern uint16_t get_periodic_check_interval(void); /* * Send warning signal to job before end time. --- a/testsuite/python/tests/test_141_1.py +++ b/testsuite/python/tests/test_141_1.py @@ -12,6 +12,7 @@ suspend_time = 10 suspend_timeout = 10 resume_timeout = 10 +periodic_check_interval = 2 @pytest.fixture(scope="module", autouse=True) @@ -37,6 +38,9 @@ # Mark nodes as IDLE, regardless of current state, when suspending nodes with # SuspendProgram so that nodes will be eligible to be resumed at a later time atf.require_config_parameter_includes("SlurmctldParameters", "idle_on_node_suspend") + atf.require_config_parameter_includes( + "SlurmctldParameters", f"periodic_check_interval={periodic_check_interval}" + ) # Register the cloud node in slurm.conf atf.require_config_parameter( @@ -106,6 +110,40 @@ # Tests +def test_periodic_check_interval(): + """Test periodic_check_interval advances a CONFIGURING job after node registration.""" + job_id = atf.submit_job_sbatch("-p cloud1 --wrap 'srun sleep 10'", fatal=True) + atf.wait_for_node_state(f"{node_prefix}1", "ALLOCATED", timeout=5, fatal=True) + atf.wait_for_node_state(f"{node_prefix}1", "POWERING_UP", fatal=True) + assert "CONFIGURING" == atf.get_job_parameter( + job_id, "JobState", default="NOT_FOUND", quiet=True + ), "Submitted job should be in CONFIGURING state while its ALLOCATED cloud node is POWERING_UP" + + # TODO: Wait 2 seconds to avoid race condition between slurmd and slurmctld + # Remove once bug 16459 is fixed. + time.sleep(2) + + atf.run_command( + f"{atf.properties['slurm-sbin-dir']}/slurmd -b -N {node_prefix}1 --conf 'feature=f1'", + fatal=True, + user="root", + ) + + atf.wait_for_node_state( + f"{node_prefix}1", + "POWERING_UP", + reverse=True, + timeout=resume_timeout + 5, + fatal=True, + ) + + assert atf.wait_for_job_state( + job_id, + "RUNNING", + timeout=periodic_check_interval + 5, + ) + + # Test state cycle of cloud nodes: POWERED_DOWN, POWERING_UP, IDLE, # POWERING_DOWN, POWERED_DOWN def test_cloud_state_cycle():

