#912: bibsched queue status report improvement
--------------------------+-----------------
Reporter: skaplun | Owner:
Type: enhancement | Status: new
Priority: major | Milestone:
Component: BibSched | Version:
Resolution: | Keywords:
--------------------------+-----------------
Comment (by simko):
Here are some old musings of mine on this topic, taken from a few sent
emails, in case they are useful for drafting the specs:
{{{
* Wed, 21 Apr 2010 15:46:22 +0200
I'd say AFFECTED = MANUAL + some interactive non-daemon tasks waiting
for execution since more than CFG_BIBSCHED_AFFECTED_THRESHOLD time in
minutes. By default we can have it to be 10 minutes or so. (And yes,
ERROR would lead naturally to AFFECTED.)
For example, for INSPIRE, there is only one bibupload job per day coming
from SPIRES. So the queue can stay all day long in the MANUAL mode, and
still everything is OK on the server side, the system is not affected.
(Provided that the records were indexed and webcoll'ed. So, if we want
to be very precise in our system health reporting, then we should also
check MAX(bibrec.modification_date) and compare it with bibindex's
global index's last updated timestamp and webcoll's last updated
timestamp. (The ranking timestamps are not crucial.))
}}}
and:
{{{
* Fri, 23 Apr 2010 16:15:16 +0200
1) Hmm, I think it may be good to report two values: queue status as
AUTOMATIC or MANUAL, and health status as NORMAL, STRESSED, or AFFECTED.
So that we can print several combinations (e.g. queue status MANUAL,
health status NORMAL). This may help to avoid any misunderstanding
about the reported values.
2) Alternatively, if we keep only one output value, that this may be
perfectly enough, but we should rather use somewhat progressive values
to express queue health status in one term, to avoid misunderstandings.
For example:
* NORMAL (auto mode, few jobs waiting, load under threshold)
* STRESSED (auto mode, many jobs waiting, load above threshold)
* STOPPED (manual mode, but no long waiting jobs, and all timestamps OK)
* AFFECTED (manual mode, but some waiting jobs, or timestamps problem)
* ERROR (manual mode, some non-ack-ed tasks)
In both cases, we should document the meanings in the admin guide.
}}}
The use of MOTD would be a perfect addition. However, if there are real
errors,
and an admin is working on them and sets up a MOTD to inform other admins
of the
situation, then I think the bibsched queue health status should still be
reported
as ERROR or AFFECTED, since this is what the real status of the problem
is. So
some care has to be paid as to the various queue status/health
combinations.
Perhaps we may need a new MOTD status to express voluntary queue
interventions.
--
Ticket URL: <http://invenio-software.org/ticket/912#comment:1>
Invenio <http://invenio-software.org>