We saw a big spike of active Grid Engine jobs starting around 2017-08-01T00:00. I've been looking at the list of active jobs and noticed that several tools had a lot of copies of the same job running. There are tools that are designed to have several copies of the same job running working from a shared queue of some sort, but often this is a sign that something is wrong with the script.
Here's fancy shell pipeline that will give you a list of all of your tool's running jobs grouped by job name and sorted by start time: qstat -xml | tr '\n' ' ' | sed 's#<job_list[^>]*>#\n#g' | sed 's#<[^>]*>##g' | grep " " | column -t | awk 'BEGIN { OFS="\t" } {print $1, $3, $6, $5}' | sort -n -k 3|sort -s -k 2 You can use this to see if you have parallel jobs running and if so when the "stuck" jobs started. It seems that there may have been some database related events happening between 2017-07-31T23:00 and 2017-08-01T06:00 that left a bunch of jobs stuck in a bad state internally. To keep your cron scheduled jobs from running in parallel, you can add the `-once` flag to your crontab. Either `jsub -once ...` or `qcronsub ...` will do this for you. When the once flag is active, jsub and qcronsub will look for jobs that your tool is already running and if there is an active job with the same name then the new job will *not* be started and an error message will be logged. The name is either provided explicitly with `-N ....` or automatically added based on the command if -N is not used. (This should probably end up on wikitech in the help somewhere...) Bryan -- Bryan Davis Wikimedia Foundation <bd...@wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA irc: bd808 v:415.839.6885 x6855 _______________________________________________ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l