#912: bibsched queue status report improvement
--------------------------+-----------------
  Reporter:  skaplun      |      Owner:
      Type:  enhancement  |     Status:  new
  Priority:  major        |  Milestone:
 Component:  BibSched     |    Version:
Resolution:               |   Keywords:
--------------------------+-----------------

Comment (by simko):

 Here are some old musings of mine on this topic, taken from a few sent
 emails, in case they are useful for drafting the specs:

 {{{
 * Wed, 21 Apr 2010 15:46:22 +0200

 I'd say AFFECTED = MANUAL + some interactive non-daemon tasks waiting
 for execution since more than CFG_BIBSCHED_AFFECTED_THRESHOLD time in
 minutes.  By default we can have it to be 10 minutes or so.  (And yes,
 ERROR would lead naturally to AFFECTED.)

 For example, for INSPIRE, there is only one bibupload job per day coming
 from SPIRES.  So the queue can stay all day long in the MANUAL mode, and
 still everything is OK on the server side, the system is not affected.

 (Provided that the records were indexed and webcoll'ed.  So, if we want
 to be very precise in our system health reporting, then we should also
 check MAX(bibrec.modification_date) and compare it with bibindex's
 global index's last updated timestamp and webcoll's last updated
 timestamp.  (The ranking timestamps are not crucial.))
 }}}

 and:

 {{{
 * Fri, 23 Apr 2010 16:15:16 +0200

 1) Hmm, I think it may be good to report two values: queue status as
 AUTOMATIC or MANUAL, and health status as NORMAL, STRESSED, or AFFECTED.
 So that we can print several combinations (e.g. queue status MANUAL,
 health status NORMAL).  This may help to avoid any misunderstanding
 about the reported values.

 2) Alternatively, if we keep only one output value, that this may be
 perfectly enough, but we should rather use somewhat progressive values
 to express queue health status in one term, to avoid misunderstandings.
 For example:

   * NORMAL (auto mode, few jobs waiting, load under threshold)
   * STRESSED (auto mode, many jobs waiting, load above threshold)
   * STOPPED (manual mode, but no long waiting jobs, and all timestamps OK)
   * AFFECTED (manual mode, but some waiting jobs, or timestamps problem)
   * ERROR (manual mode, some non-ack-ed tasks)

 In both cases, we should document the meanings in the admin guide.
 }}}

 The use of MOTD would be a perfect addition.  However, if there are real
 errors,
 and an admin is working on them and sets up a MOTD to inform other admins
 of the
 situation, then I think the bibsched queue health status should still be
 reported
 as ERROR or AFFECTED, since this is what the real status of the problem
 is.  So
 some care has to be paid as to the various queue status/health
 combinations.
 Perhaps we may need a new MOTD status to express voluntary queue
 interventions.

-- 
Ticket URL: <http://invenio-software.org/ticket/912#comment:1>
Invenio <http://invenio-software.org>

Reply via email to