#856: BibSched: tasks not halting the queue on failure
------------------------+-----------------------
Reporter: jlavik | Owner:
Type: enhancement | Status: assigned
Priority: critical | Component: BibSched
Version: | Resolution:
Keywords: |
------------------------+-----------------------
Changes (by simko):
* status: in_merge => assigned
Comment:
This feature is obviously very good to have, but instead of treating
errors coming from some tasks as blocking the queue, while errors
coming from other tasks and non-blocking the queue, I think it is
better to look at this problem not from task-specific point of view,
but rather from the error-specific point of view. This because
certain types of errors may need queue blockage and may occur for any
bibtask, say when DB is down.
Borrowing a terminology from the Lisp world, the errors can be roughly
classified into two types: "fatal errors" that would stop the queue,
and "continuable errors" that would not stop the queue, for other
fellow tasks.
Say that BibIndex cannot index certain records due to UTF-8 bug. Thus
it would emit continuable error (CERROR) which will not prevent other
waiting tasks such as BibRank from being launched by BibSched daemon.
But when BibIndex is awaken next time, it should refuse to run anew,
because the last time it ended up in the CERROR state.
Say that BibIndex cannot index certain records because DB is down or
disk is full. Thus it would emit fatal error (ERROR) that should
cause the queue to stop. There should be no need for the BibSched
daemon to wake up also BibRank and other waiting tasks, only to
discover that they crash in their turns.
Thus, if a continuable error occurs, only tasks of the same nature
would refuse to continue, while others can go. If a fatal error
occurs, everything stops.
Applying this point of view onto the current code base, we would need
to introduce a new continuable error type CERROR, and our current
ERROR would stay fatal error, but most of our bibtasks could be
transformed into emitting mostly CERROR's almost everywhere, except in
places such as BibUpload and friends. So, instead of white-listing
certain tasks for continuable errors such as refextract, that this patch
does, I think we could be more aggressive and start changing ERROR
into CERROR for most tasks, kind of like going for black-listing
certain error situations while white-listing most others. We can be
progressively changing all the tasks to distinguish betwenen emitting
ERROR and CERROR, as the time will permit. My concern here was mostly
that we should rather start by treating these cases on a per-error basis,
not on a per-task basis.
Please tell me what you think.
--
Ticket URL: <http://invenio-software.org/ticket/856#comment:5>
Invenio <http://invenio-software.org>