Einar,

The strings in your $SLURM_JOB_ID values or host names are likely too long to 
serve as jobid for the Lustre Jobstats feature .

You might try %H instead of %h in jobid_name. For reference, from the Lustre 
manual, https://doc.lustre.org/lustre_manual.xhtml#jobstats :

> %e print executable name
> %g print group ID number
> %h print fully-qualified hostname
> %H print short hostname
> %j print JobID from process environment variable named by the jobid_var 
> parameter
> %p print numeric process ID
> %u print user ID number


On my system (2.12), I use:

        jobid_var=PBS_JOBID
        jobid_name=%e.%u

I get job_stats by $PBS_JOBID, as expected, from processes that actually have 
the variable set, and synthetic %e.%u values from all others, like processes on 
interactive or backup nodes. This has been working just fine to pinpoint the 
source of occasional trouble.

Curiously, I don't think the manual spells out what happens when the variable 
referenced by jobid_var is unset, i.e., the above fallback logic from jobid_var 
to jobid_name.


With best regards,
-- 
Michael Sternberg, Ph.D.
Principal Scientific Computing Administrator
Center for Nanoscale Materials
Argonne National Laboratory




> On Aug 12, 2022, at 03:37, Einar Næss Jensen <[email protected]> 
> wrote:
> logfiles on oss servers are full of these error messages:
> Invalid jobid size (37), expect(32)
> What does it mean?
> 
> we have set this:
> [root@mds-1 ~]# lctl get_param jobid_var jobid_name
> jobid_var=SLURM_JOB_ID
> jobid_name=%j:%u:%h
> 
> lustre version is 2.12.6(ddn)
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to