#856: BibSched: tasks not halting the queue on failure
------------------------+-----------------------
Reporter:  jlavik       |       Owner:
    Type:  enhancement  |      Status:  assigned
Priority:  critical     |   Component:  BibSched
 Version:               |  Resolution:
Keywords:               |
------------------------+-----------------------
Changes (by simko):

 * status:  in_merge => assigned


Comment:

 This feature is obviously very good to have, but instead of treating
 errors coming from some tasks as blocking the queue, while errors
 coming from other tasks and non-blocking the queue, I think it is
 better to look at this problem not from task-specific point of view,
 but rather from the error-specific point of view.  This because
 certain types of errors may need queue blockage and may occur for any
 bibtask, say when DB is down.

 Borrowing a terminology from the Lisp world, the errors can be roughly
 classified into two types: "fatal errors" that would stop the queue,
 and "continuable errors" that would not stop the queue, for other
 fellow tasks.

 Say that BibIndex cannot index certain records due to UTF-8 bug.  Thus
 it would emit continuable error (CERROR) which will not prevent other
 waiting tasks such as BibRank from being launched by BibSched daemon.
 But when BibIndex is awaken next time, it should refuse to run anew,
 because the last time it ended up in the CERROR state.

 Say that BibIndex cannot index certain records because DB is down or
 disk is full.  Thus it would emit fatal error (ERROR) that should
 cause the queue to stop.  There should be no need for the BibSched
 daemon to wake up also BibRank and other waiting tasks, only to
 discover that they crash in their turns.

 Thus, if a continuable error occurs, only tasks of the same nature
 would refuse to continue, while others can go.  If a fatal error
 occurs, everything stops.

 Applying this point of view onto the current code base, we would need
 to introduce a new continuable error type CERROR, and our current
 ERROR would stay fatal error, but most of our bibtasks could be
 transformed into emitting mostly CERROR's almost everywhere, except in
 places such as BibUpload and friends.  So, instead of white-listing
 certain tasks for continuable errors such as refextract, that this patch
 does, I think we could be more aggressive and start changing ERROR
 into CERROR for most tasks, kind of like going for black-listing
 certain error situations while white-listing most others.  We can be
 progressively changing all the tasks to distinguish betwenen emitting
 ERROR and CERROR, as the time will permit.  My concern here was mostly
 that we should rather start by treating these cases on a per-error basis,
 not on a per-task basis.

 Please tell me what you think.

-- 
Ticket URL: <http://invenio-software.org/ticket/856#comment:5>
Invenio <http://invenio-software.org>

Reply via email to