[slurm-dev] Re: How to spread jobs among nodes?

2014-05-08 Thread Bjørn-Helge Mevik
tributed across many nodes. Also see the SelectParameters configuration parame- ter CR_LLN to use the least loaded nodes in every partition. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Bjørn-Helge Mevik
Just a short note about terminology. I believe "processor equivalents" (PE) is a much used term for this. It is at least what Maui and Moab uses, if I recall correctly. The "resource*time" would then be PE seconds (or hours, or whatever). -- Regards, Bjørn-Helge Mevik, dr

[slurm-dev] Customized error messages from job_submit.lua?

2014-08-07 Thread Bjørn-Helge Mevik
that are printed on the user's stderr? If so, how? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Customized error messages from job_submit.lua?

2014-08-07 Thread Bjørn-Helge Mevik
Thanks! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] make check fails with 14.03.6

2014-08-15 Thread Bjørn-Helge Mevik
agnu dejagnu-1.4.4-17.el6.noarch # rpm -q check check-0.9.8-1.1.el6.x86_64 Is there something else we are missing? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: make check fails with 14.03.6

2014-08-18 Thread Bjørn-Helge Mevik
writes: > As far as I can tell, it has been this way (broken) forever. It will be fixed > in 14.03.7. Thx! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Suggested fixes for slurm test suite

2014-08-26 Thread Bjørn-Helge Mevik
nfortunately, I don't know enough Expect (tcl?) to suggest how to implement that. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Bug in sgather

2014-08-26 Thread Bjørn-Helge Mevik
ient. It would also be nice if the node-global destinations could be configurable, instead of being hard-coded in the script (or at least be set at the top of the script). For instance, on our system, the node-global file systems are /work and /cluster, not /scratch and /home. -- Regards, B

[slurm-dev] UserCPU etc. for subprocesses not registered when a job times out.

2014-09-12 Thread Bjørn-Helge Mevik
COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187 43COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 43.batch COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 i.e., time spent in subprocesses is reported. -- Regards, Bjørn-Helge Mevik, dr. scient, Department fo

[slurm-dev] Override memory limits with --exclusive?

2014-09-18 Thread Bjørn-Helge Mevik
like this? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Override memory limits with --exclusive?

2014-09-19 Thread Bjørn-Helge Mevik
Thanks for the tip! We actually already have a setup where "srun --ntasks=$SLURM_JOB_NUM_NODES /bin/true" is run at the start of every job, so we're definitely going to look into this. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] slurmctld crashes on testsuite in 14.03.[8--10]

2014-11-06 Thread Bjørn-Helge Mevik
_bitstr_bits(tmp_qos_bitstr) was still 26. Any help in figuring out what goes wrong (or how to fix it :) is appreciated! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] _slurm_cgroup_destroy message?

2014-11-18 Thread Bjørn-Helge Mevik
the .../uid_NNN directories are removed. Does anyone know what these messages mean? Should we just ignore them? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: _slurm_cgroup_destroy message?

2014-11-19 Thread Bjørn-Helge Mevik
ng the test suite for versions 14.03.8--14.03.10, we didn't upgrade to .10.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Preemption, requeue and checkpointing?

2015-01-09 Thread Bjørn-Helge Mevik
I second this wish. :) -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Expanding TotalCPU to include child processes

2015-03-04 Thread Bjørn-Helge Mevik
clude CPU time of child processes.” In my experience, that description might not be accurate. It seems also child processes are included, as long as the job doesn't time out. Here is an email I wrote about it last year: From: Bjørn-Helge Mevik Subject: [slurm-dev] UserCPU etc. for subprocess

[slurm-dev] "Nested cgroup" messages

2015-03-17 Thread Bjørn-Helge Mevik
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: "Nested cgroup" messages

2015-03-18 Thread Bjørn-Helge Mevik
Thanks! I'll try to apply that patch. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Bjørn-Helge Mevik
rds, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Bjørn-Helge Mevik
pi itself? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-17 Thread Bjørn-Helge Mevik
Thanks! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] (Custom) warnings from job_submit.lua?

2015-05-11 Thread Bjørn-Helge Mevik
.03.7, btw.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: (Custom) warnings from job_submit.lua?

2015-05-12 Thread Bjørn-Helge Mevik
Ok, thanks. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Off-topic: What accounting system do you use?

2015-06-24 Thread Bjørn-Helge Mevik
accounting. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
.03.7. Perhaps this has changed in later versions? Also, "nothing is ever easy": we want to account not CPU hours, but PE (processor equivalents) hours. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
Christopher Samuel writes: > http://karaage.readthedocs.org/en/latest/introduction.html Karaage looks interesting for managing projects and users. Can it manage usage limits? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
atabase makes Gold quite slow, so we have had to add quite a lot of error checking and handling in the prolog and epilog scripts. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Understand GrpCPUMins

2015-06-30 Thread Bjørn-Helge Mevik
LIMIT *** That usually means the job tried to run longer than its --time specification. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Issues with --switches option

2015-09-03 Thread Bjørn-Helge Mevik
h IB1 or all with IB2. Search for "Matching OR" in the sbatch man page for details. (We used this on our previous cluster, which had two different IB networks.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-10-05 Thread Bjørn-Helge Mevik
thorough, unfortunately, but according to https://lists.fedoraproject.org/pipermail/mingw/2012-January/004421.html the .la files are only needed in order to link against static libraries, and since Slurm doesn't provide any static libraries, I guess it would be safe for the slurm-devel rpm not to

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Bjørn-Helge Mevik
we activated checkpointing. When slurmcltd started, the checkpointing plugin expected some extra data in the job states, which obviously wasn't there, and slurmctld decided the data was invalid and killed all jobs. (I don't know if this is still a problem.) -- Regards, Bjørn-Helge

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-14 Thread Bjørn-Helge Mevik
Thanks. Nice to know! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: [slurm-devel] update SLURM 2.6.7 to SLURM 15.0.8.4

2015-11-16 Thread Bjørn-Helge Mevik
with static libraries, which slurm does _not_ install. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Preempting without account limits

2015-12-14 Thread Bjørn-Helge Mevik
s enough. We have the partition because our lowpri jobs are allowed to run on special nodes (like hugemem or accellerator nodes) that normal jobs are not allowed to use.) I hope this made sense to you. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: cgroups and memory accounting

2015-12-15 Thread Bjørn-Helge Mevik
e data between several processes, the shared space will be counted once for each process(!). Cgroups seems to count the shared data only once. So if a process is killed by oom instead of by slurm, it is probably not due to shared data. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Re

[slurm-dev] Re: cgroups and memory accounting

2015-12-18 Thread Bjørn-Helge Mevik
Felip Moll writes: > I will try JobAcctGatherParams to NoShared. > > This is an example of job step being killed. It's being killed by oom, but > it's invoked by cgroups: Since the job was killed by the oom, NoShared will not help. It does not affect cgroups. -- Rega

[slurm-dev] Re: Preempting without account limits

2015-12-18 Thread Bjørn-Helge Mevik
"Wiegand, Paul" writes: > This worked. Thank you Bjørn-Helge. You're welcome! :) -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: cgroups and memory accounting

2015-12-18 Thread Bjørn-Helge Mevik
n a process needs more memory instead of killing the process. If I'm correct, oom will _not_ kill a job due to cached data. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Bug and suggested fix in testsuite test 14.10

2016-02-25 Thread Bjørn-Helge Mevik
Test 14.10 in the test suite (of slurm 15.08.8, at least) uses $sinfo -tidle -h -o%n to find idle nodes. This only works if NodeHostname == NodeName on the nodes. The following should work regardless of this: $scontrol show hostnames \$($sinfo -tidle -h -o%N) -- Regards, Bjørn-Helge

[slurm-dev] Re: Kill Signals Sent By SLURM

2016-02-26 Thread Bjørn-Helge Mevik
obs just before they time out. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Patch for health check during slurmd start

2016-03-03 Thread Bjørn-Helge Mevik
writes: > We are looking for comments and feedback on this proposed behavior [...] > +#define HEALTH_RETRY_DELAY 10 Have you thought about using the health_check_interval instead? Or make it a separate configurable option? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Re

[slurm-dev] Inconsistent reporting of errors in #SBATCH lines

2016-03-11 Thread Bjørn-Helge Mevik
tch empty-jobname.sm sbatch: option requires an argument -- 'J' Submitted batch job 14221261 $ A more consistent behaviour would have been nice. My suggestion is: report error and fail to submit the job. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] What cluster provisioning system do you use?

2016-03-15 Thread Bjørn-Helge Mevik
provisioning tool? - A locally developed solution? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Regards Postgres Plugin for SLURM

2016-03-29 Thread Bjørn-Helge Mevik
+1 -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Slurm no longer optimized by default

2016-04-25 Thread Bjørn-Helge Mevik
I just noticed that as of 14.11.6, optimization is turned off (-O0) by default when building slurm. Is there any reason not to use --disable-debug when building slurm for a production cluster? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: using gdb to debug slurm-15.08?

2016-04-27 Thread Bjørn-Helge Mevik
md appear to be Just a note: I tried this (for a different reason), but found out it didn't have any effect (gather the output to a log file and look at the gcc lines). However, if I did -D '%with_cflags CFLAGS="-O0 -g3"' (i.e., removed the initial "_"), it had the

[slurm-dev] Re: QoS TRES limits

2016-08-02 Thread Bjørn-Helge Mevik
low" qos, and to get that to work, we've found that we must put the accounts normal limits on a qos, not on the account itself. Usually this means that we have a qos for each account, and then a common "low" qos. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: The canonical way to write to user's output (stderr) log file on end of job

2016-08-30 Thread Bjørn-Helge Mevik
;, which prints out resource usage. As long as users remember to source the setup file, they get the usage statistics in the bottom of their stdout file. Not very elegant, but it works. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Prolog script (maybe) question?

2016-09-15 Thread Bjørn-Helge Mevik
To me, this sounds like a job for a job submit plugin, for instance job_submit.lua. That way you could reject the job before it gets submitted into the queue. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: How to use the EpilogSlurmctld to print job statistics

2016-10-13 Thread Bjørn-Helge Mevik
to generate > my report. Does this approach make sense or are there better > alternatives. sacct can also give you the submit time, start time, end time and elapsed time. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Restrict users to see only jobs of their groups

2016-11-02 Thread Bjørn-Helge Mevik
There is a plugin under development, that will/might provide those features. It was presented at SLUG 16: http://slurm.schedmd.com/SLUG16/MCS.pdf -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Restrict users to see only jobs of their groups

2016-11-02 Thread Bjørn-Helge Mevik
apart from that, this is my understanding too. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] No longer possible to use scancel in PrologSlurmctld?

2016-11-21 Thread Bjørn-Helge Mevik
#x27;d guess it should have been possible to use scancel in PrologSlurmctld also in 15.08.12. Does anyone know if this is an intentional change (and SchedMD just forgot to update the docs) or a bug? (I haven't found anything relevant in the NEWS file or on bugs.schedmd.com.) -- Regards, B

[slurm-dev] Re: max submit tasks

2016-11-22 Thread Bjørn-Helge Mevik
access to more then one account, I can use 1000 cpus in each account. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: max submit tasks

2016-11-22 Thread Bjørn-Helge Mevik
Jordan Willis writes: >Thank you, >Can you confirm that this will take an update from SLURM 14.11.15 to >current? I never ran 14.11, but in 14.03, you can use GrpCPUs=1000 instead of GrpTRES=cpu=1000. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Bjørn-Helge Mevik
tell this Slurm. - OK, I can ask for 24 > cores and 64 GB in a node, but then I do not get the chance to run on 12 > cores/32 GB. For the memory part, you could specify --mem=0. That will allocate all of the memory on whichever node the job lands on. For the number of cores, I don't know.

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Bjørn-Helge Mevik
cpus it is allowed to use.) This is on 15.08.12. YMMV. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Submit job with maximum ntasks per node

2016-12-14 Thread Bjørn-Helge Mevik
Check out the thread on this list about a week ago, titled "Unrestricted use of a node". (In short, --exclusive with --mem=0 or --mem-per-cpu=0 might be more or less what you want.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-16 Thread Bjørn-Helge Mevik
size" [1] (JobAcctGatherParams=UsePss), which is cgroup uses (I believe), and sounds like the best estimate to me. [1] https://en.wikipedia.org/wiki/Proportional_set_size -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurm mail domain?

2017-03-02 Thread Bjørn-Helge Mevik
omain config parameter was added in Slurm 17.02. A different option would be to configure your sendmail to accept domain-less mails (and perhaps add a default domain itself). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Job-Specific Working Directory on Local Scratch

2017-03-14 Thread Bjørn-Helge Mevik
file names to a dot file in $SCRATCH) - The Epilog copies any registered files back to the job submit dir (it uses "su - $USER" when doing this). - The epilog deletes the directory -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signa

[slurm-dev] Slurmdbd Perl api?

2017-04-03 Thread Bjørn-Helge Mevik
ke use Slurmdb qw(:all SLURMDB_ADD_USER); $what = SLURMDB_ADD_USER(); just gives the error "SLURMDB_ADD_USER is not a valid Slurmdb macro" -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-02 Thread Bjørn-Helge Mevik
u.se/~kent/python-hostlist/ by Kent Engström at NSC. It's > simple to install this as an RPM package, see > https://wiki.fysik.dtu.dk/niflheim/SLURM#expanding-host-lists For the simple case you show, you could just use $ scontrol show hostnames a[095,097-098] a095 a097 a098 -- Regards,

[slurm-dev] Re: How to get pids of a job

2017-05-16 Thread Bjørn-Helge Mevik
e pids. Not that I know of, but it should be possible to script. > And how to parse the nodelist like "cn[11033,11069],gn[1103-1120]" ? scontrol show hostnames cn[11033,11069],gn[1103-1120] -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: thoughts on task preemption

2017-05-23 Thread Bjørn-Helge Mevik
tup, the first option is preferrable; just putting it on the queue and let it wait until it's turn. But of course, there are other setups where the second option would be best. Could you perhaps make it configurable, so a site can choose? -- Regards, Bjørn-Helge Mevik, dr. scient, Department

[slurm-dev] Re: Accounting: preventing scheduling after TRES limit reached (permanently)

2017-06-06 Thread Bjørn-Helge Mevik
ric usage. Then you can set FairshareWeight to 0 and use the Grp*Mins parameters to set hard limits. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: #SBATCH --time= not always overriding default?

2017-06-30 Thread Bjørn-Helge Mevik
dictable, both for the programmer and for the user. It is by design, because people often need to give arguments or options to their jobscript, e.g., sbatch --time=1-0:0:0 myjob.sh inputfile -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Prolog and sbatch

2017-07-02 Thread Bjørn-Helge Mevik
ch), but looks like this assumption is wrong? That is right, that is wrong. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Rolling maintenance jobs

2017-08-02 Thread Bjørn-Helge Mevik
feature from the node, and then request themself to be requeued. Prior to submit the jobs, we add the "fixme" feature to all nodes needing maintenance. (In reality, our setup is a little mor complex, since it includes reinstalling the os on the nodes, but the principle is the same

[slurm-dev] Re: Rolling maintenance jobs

2017-08-03 Thread Bjørn-Helge Mevik
will be looking at this feature again. Thanks for the tip! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Preemtion and signals

2017-10-09 Thread Bjørn-Helge Mevik
soon as the signal arrives. I got bit by this behaviour trying to do exactly the same that you did. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Preemtion and signals

2017-10-10 Thread Bjørn-Helge Mevik
rong with how my partitions are defined? That sounds unlikely, IMO. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Bjørn-Helge Mevik
of small, distributed jobs running, and a long queue of pending jobs), I personally wouldn't want schedmd to sacrifice that for making updates of node lists easier. Especially since I haven't seen the problem JinSung Kang reports. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for R

[slurm-dev] Re: How to strictly limit the memory per CPU

2017-11-03 Thread Bjørn-Helge Mevik
job for the lua job submit plugin (job_submit.lua). It can check what users have specified, write out custom errors or change the settings of jobs. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Questions about SlurmUser and file permissions

2012-03-22 Thread Bjørn-Helge Mevik
Moe Jette writes: > You findings are correct. I would like to leave the log file creation > in the same place (as soon as possible) instead of moving after > changing the user and group ID. The following patch creates the file > at the same time, but immediately changes the file owner. Rat

[slurm-dev] Re: chroot usage for jobs in slurm

2012-04-13 Thread Bjørn-Helge Mevik
uster for sensitive data, in which there should be no (or as few as possible) possibilities for information to leak between jobs. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] List pending reboots?

2012-04-25 Thread Bjørn-Helge Mevik
her a given node has a reboot pending? -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: List pending reboots?

2012-05-02 Thread Bjørn-Helge Mevik
egards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] ETA for slurm 2.4 or 2.4 RC?

2012-05-04 Thread Bjørn-Helge Mevik
We are in the process of setting up a new cluster. It is supposed to go in production by the end of June, and we would prefer to have slurm 2.4 on it. Do you have any plans/ideas for when 2.4 (or at least release candidates for 2.4) will be out? -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] Re: ETA for slurm 2.4 or 2.4 RC?

2012-05-07 Thread Bjørn-Helge Mevik
Moe Jette writes: > We hope to have a v2.4.0-rc1 in a couple of weeks and release 2.4 a > few weeks later. Very nice! -- Cheers, Bjørn-Helge Mevik

[slurm-dev] Gres: documentation discrepancy

2012-05-08 Thread Bjørn-Helge Mevik
llocated all of the generic resources that have allocated to the job." Which one is correct? (I'm voting on man srun. :) -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Is default for NodeAddr NodeName or NodeHostname?

2012-05-25 Thread Bjørn-Helge Mevik
Disk=1 Weight=1027 BootTime=2012-05-08T15:07:08 SlurmdStartTime=2012-05-25T10:30:10 (This is with 2.4.0-0.pre4.) (We are planning to use cx-y instead of compute-x-y (the rocks default) on our next cluster, to save some typing.) -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computin

[slurm-dev] Re: How to setup local disk as gres?

2012-07-02 Thread Bjørn-Helge Mevik
ThreadsPerCore=1 TmpDisk=0 Weight=666 BootTime=2012-06-13T16:20:49 SlurmdStartTime=2012-07-02T16:32:31 -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: How to setup local disk as gres?

2012-07-03 Thread Bjørn-Helge Mevik
G when they want 10 GB. :) Thanks for quick response! -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] problem with sstat -j jobid.batch when #nodes > 1

2012-07-03 Thread Bjørn-Helge Mevik
-03T17:11:02] [195.0] done with job [2012-07-03T17:11:02] error: stat_jobacct for invalid job_id: 195 [2012-07-03T17:11:02] debug: _rpc_terminate_job, uid = 501 [2012-07-03T17:11:02] debug: task_slurmd_release_resources: 195 Is there something wrong here, or are we doing something wrong? --

[slurm-dev] Questions about the task/cgroup plugin

2012-09-04 Thread Bjørn-Helge Mevik
not turn of loggin alltogether)? (Any other comments and suggestions about the task/cgroup use are also welcome!) -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: Questions about the task/cgroup plugin

2012-09-05 Thread Bjørn-Helge Mevik
inated by the OOM killer. It is > not perfect, but works 90% of the time. I can send it to you if > you like. Yes, I'd very much like that! Jobs killed by memory limit is quite common on our cluster, and users get confused if there is no message telling them why the job died. Than

[slurm-dev] Re: Questions about the task/cgroup plugin

2012-09-06 Thread Bjørn-Helge Mevik
" lua plugin to the slurm-spank-plugins > project on google code called oom-detect.lua. You can browse the > code here: Thanks! -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: Srun to use resources from salloc/sbatch? Another approach to array jobs?

2012-09-15 Thread Bjørn-Helge Mevik
available. Using srun within a fixed allocation limits your subjobs to the resources in that allocation. (Now that slurm jobs can grow and shrink, there might be a way to avoid this.) -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Excessive requeueing of jobs.

2012-10-12 Thread Bjørn-Helge Mevik
11T14:59:10 1:0 8 c11-13 (There is a node-failure in there, and the job failed when it finally got to run long enough.) Apart from a short period around 21:00 the 10., less than 7,000 of the ~ 10,000 cores were used. -- Bjørn-Helge Mevik, dr. scient, Research Computing Ser

[slurm-dev] Re: SLURM without shared home?

2012-11-20 Thread Bjørn-Helge Mevik
ries once, but it was a kludge. We ended up writing quite comples prolog and epilog scripts that copied files back and forth, and had to put quite strict limits on how and where to submit jobs. Just my 2¢. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: Slurm, RHEL6, cgroups and not constraining memory

2013-01-18 Thread Bjørn-Helge Mevik
t;) RAM. Try filling the allocated memory with some values, and you will probably see that after filling 4 GiB, the job is killed. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: Slurm, RHEL6, cgroups and not constraining memory

2013-01-21 Thread Bjørn-Helge Mevik
Christopher Samuel writes: > On 18/01/13 19:53, Bjørn-Helge Mevik wrote: > >> I don't know if this is the reason in your case, but note that cgroup >> in slurm constrains_resident_ RAM, not_allocated_ ("virtual") RAM. > > Hmm, as a sysadmin that doesn&#x

[slurm-dev] Re: Slurm, RHEL6, cgroups and not constraining memory

2013-01-23 Thread Bjørn-Helge Mevik
here is no message about why in slurm-nnn.out. Then we get an RT ticket, and have to grep in /var/log/messages. -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: Problem with backfill and patch for solution

2013-03-01 Thread Bjørn-Helge Mevik
ill make backfill look at the whole queue eventually. Interesting. We will take a look at this. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] How to emulate qsub's "-sync y"/"-Wblock=true"

2013-03-05 Thread Bjørn-Helge Mevik
ls the queue system until the job has finished. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: How to emulate qsub's "-sync y"/"-Wblock=true"

2013-03-06 Thread Bjørn-Helge Mevik
en launch the main program, and then perhaps do some cleanup afterwards. Thus one wouldn't want the job script itself to be run in parallell. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Strange value for CUDA_VISIBLE_DEVICES

2013-04-10 Thread Bjørn-Helge Mevik
res:2, CUDA_VISIBLE_DEVICES gets the value 0,1633906540 Is this correct? Are we doing something wrong? (This is slurm 2.4.3, running on Rocks 6.0 based on CentOS 6.2.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Strange value of CUDA_VISIBLE_DEVICES

2013-04-11 Thread Bjørn-Helge Mevik
rds, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Strange value of CUDA_VISIBLE_DEVICES

2013-04-16 Thread Bjørn-Helge Mevik
Gary Brown writes: > FYI, the value 1633906540 in hex is 61636f6c, which is ASCII "acol" and > usually points to some kind of buffer overrun bug. Thanks for the tip! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

  1   2   >