Cheers SLURM people,
We're seeing some intermittent job failures in our SLURM cluster, all with the
same 137 exit code. I'm having difficulty in determining whether this error
code is coming from SLURM (timeout?) or the Linux OS (process killed, maybe
memory).
In this example, there's the WEXITSTATUS in the slurmctld.log, error:0 status
35072 in the slurd.log, and ExitCode 9:0 in the accounting log....???
Does anyone have insight into how all these correlate? I've spent a
significant amount of time digging through the documentation, and I don't see
a clear way on how to interpret all these...
Example: Job: 62791
[root@XXXXXXXXXXXXX] /var/log/slurm# grep -ai jobid=62791 slurmctld.log
[2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791
InitPrio=4294845347 usec=679
[2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList= XXXXXXXXXXXXX
#CPUs=1 Partition=normal
[2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
[2020-08-13T11:17:45.294] _job_complete: JobId=62791 done
[root@ XXXXXXXXXXXXX] /var/log/slurm# grep 62791 slurmd.log
[2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 ran for 0
seconds
[2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
[2020-08-13T11:17:45.280] [62791.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT,
error:0 status 35072
[2020-08-13T11:17:45.405] [62791.batch] done with job
[root@XXXXXXXXXXXXX] /var/log/slurm# sacct -j 62791
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
62791 nf-normal+ normal (null) 0 FAILED 9:0
[root@XXXXXXXXXXXXX] /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
JobID UID JobName Partition NNodes NodeList State
Start End Timelimit
62791 847694 nf-normal+ normal 1 XXXXXXXXXXX.+ FAILED
2020-08-13T10:58:29 2020-08-13T11:17:45 UNLIMITED
Thank you!
Anthony
________________________________________
IMPORTANT - PLEASE READ: This electronic message, including its attachments, is
CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED
information and is intended for the authorized recipient of the sender. If you
are not the intended recipient, you are hereby notified that any use,
disclosure, copying, or distribution of this message or any of the information
included in it is unauthorized and strictly prohibited. If you have received
this message in error, please immediately notify the sender by reply e-mail and
permanently delete this message and its attachments, along with any copies
thereof, from all locations received (e.g., computer, mobile device, etc.). To
the extent permitted by law, we may monitor electronic communications for the
purposes of ensuring compliance with our legal and regulatory obligations and
internal policies. We may also collect email traffic headers for analyzing
patterns of network traffic and managing client relationships. For further
information see: https://www.iqvia.com/about-us/privacy/privacy-policy. Thank
you.
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf