Tim Landscheidt (2014-05-12 01:58):
...
How do I make sure I get a notification of job failure?
You can tell jsub or more precisely the underlying qsub with
the option "-m e" to send a notification on several occa-
sions (cf. "man qsub"):

| [...]

|        -m b|e|a|s|n,...
|               Available  for  qsub,  qsh, qrsh, qlogin and
|               qalter only.

|               Defines or  redefines  under  which  circum‐
|               stances  mail is to be sent to the job owner
|               or to the users defined with the  -M  option
|               described  below.  The option arguments have
|               the following meaning:

|               `b'     Mail is sent at the beginning of the job.
|               `e'     Mail is sent at the end of the job.
|               `a'     Mail is sent when the job is aborted or
|                       rescheduled.
|               `s'     Mail is sent when the job is suspended.
|               `n'     No mail is sent.

|               Currently no mail is sent when a job is sus‐
|               pended.

| [...]

For example, "jsub -mem 10m -m a php" leads to:

| Job 821685 (php5) Aborted
|  Exit Status      = 139
|  Signal           = SEGV
|  User             = tools.wikilint
|  Queue            = [email protected]
|  Host             = tools-exec-07.eqiad.wmflabs
|  Start Time       = 05/11/2014 23:49:56
|  End Time         = 05/11/2014 23:49:56
|  CPU              = 00:00:00
|  Max vmem         = NA
| failed assumedly after job because:
| job 821685.1 died through signal SEGV (11)

NB: "Aborted" means aborted in the grid sense.  "jsub -mem
10m -m a false" will not generate a mail for example.  So
you might want to use "-m aes" and filter notifications
about successful jobs in your mail client.


Thanks. I've ended up mailing errors manually if the job is not aborted.

Below is my script in case anyone else wishes to use it.

#!/bin/bash

# Functions
# ----------------------------

function sendFailMail {
echo -e "Subject: [job-fail] Something is wrong...\n\n$1" | /usr/sbin/exim -odf -i [email protected]
}

function checkResult {
  local result=$?
  if [ "$result" -ne "0" ]; then
    local message="[ERROR] exit code from command was non-zero: $result"
    echo $message
    sendFailMail $message
  fi
}

function checkLog {
  local logFile=$1
  local matchString="(Warning|Error|Notice):"
  if grep -Eq $matchString $logFile
  then
    local message="Log contains errors or something strange..."
    message="$message\n\n"`grep -E $matchString $logFile`
    echo -e $message
    sendFailMail "$message"
  fi
}

# Script
# ----------------------------

nowDate=`date +"%Y-%m-%d %H:%M"`
logFile="../dna_refresh-"`date +%Y-%m`".php-out"

cd /data/project/dna/public_html
echo -e "\n\n-----------------------\n$nowDate\n- - - - - - - - - - - -\n" >> $logFile
php ./index.php &>> $logFile
#checkResult
checkLog $logFile




_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Reply via email to