We saw a big spike of active Grid Engine jobs starting around
2017-08-01T00:00. I've been looking at the list of active jobs and
noticed that several tools had a lot of copies of the same job
running. There are tools that are designed to have several copies of
the same job running working from a shared queue of some sort, but
often this is a sign that something is wrong with the script.

Here's fancy shell pipeline that will give you a list of all of your
tool's running jobs grouped by job name and sorted by start time:

  qstat -xml |
  tr '\n' ' ' |
  sed 's#<job_list[^>]*>#\n#g' |
  sed 's#<[^>]*>##g' |
  grep " " |
  column -t |
  awk 'BEGIN { OFS="\t" } {print $1, $3, $6, $5}' |
  sort -n -k 3|sort -s -k 2

You can use this to see if you have parallel jobs running and if so
when the "stuck" jobs started. It seems that there may have been some
database related events happening between 2017-07-31T23:00 and
2017-08-01T06:00 that left a bunch of jobs stuck in a bad state
internally.

To keep your cron scheduled jobs from running in parallel, you can add
the `-once` flag to your crontab. Either `jsub -once ...` or `qcronsub
...` will do this for you. When the once flag is active, jsub and
qcronsub will look for jobs that your tool is already running and if
there is an active job with the same name then the new job will *not*
be started and an error message will be logged. The name is either
provided explicitly with `-N ....` or automatically added based on the
command if -N is not used.

(This should probably end up on wikitech in the help somewhere...)

Bryan
-- 
Bryan Davis              Wikimedia Foundation    <bd...@wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services          Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855

_______________________________________________
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l

Reply via email to