#853: oaiharvest: better handling of remote OAI sources timing out
-------------------------+-----------------
 Reporter:  jcaffaro     |      Owner:
     Type:  enhancement  |     Status:  new
 Priority:  major        |  Milestone:
Component:  BibHarvest   |    Version:
 Keywords:  oaiharvest   |
-------------------------+-----------------
 Task #511 improved the handling of exceptions thrown when remote sources
 are not available. It even added ''retries'' to still achieve the
 harvesting when remote sources reply with error messages. However it can
 still happen that sources time out without being retried:
 {{{
 2011-11-23 18:01:26 --> source arXiv is going to be updated
 2011-11-23 18:02:06 --> an error occurred while harvesting from source
 arXiv:
 An error occured when trying to read response from export.arxiv.org: timed
 out
 }}}

 The timeout is probably reported by {{{socket.error}}} (for example at
 
source:modules/bibharvest/lib/oai_harvest_getter.py@55a26d516ec820a5905d9a506419e4215e05573a#L291
 as well as on lines 276/285?).

 It would be nice if timeouts were handled similarly to HTTP errors (i.e.
 trigger other attempts to harvest).

 Some comments:
  * It might (?) be as simple as adding some statement like '{{{if i <
 attempt: time.sleep(10); continue}}}' before the '{{{raise}}}' statement
 on line 292.
  * Setting a higher timeout with '{{{socket.settimeout(..)}}}' might help
 (one should be careful with side-effects, such as the one described at
 
https://twiki.cern.ch/twiki/bin/view/CDS/PythonGotchas#3_2_Incompatibility_between_SSL),
 so that timeout should be reset after it has been changed. Note that a
 '{{{timeout}}}' parameter was added to HTTPConnection/HTTPSConnection
 classes in Python 2.6 (probably calling '{{{socket.settimeout(..)}}}'
 behind the scene).
  * A more important refactoring could lead to the oaiharvest task to be
 re-submitted once several attempts have failed. The behaviour could be the
 following one: after max attempts is reached, if task is run periodically
 (with '-s' option) then gently terminate the task (and don't update
 'lastrun' field) and change its scheduled running time to be in +5
 minutes. This would lead to a slow drift in time of the daily execution of
 the task if not handled properly. One could think of alternative options
 to get the harvesting postponed/retried. Some might also like to keep the
 current behaviour (failing task) or to simply wait for the next regular
 scheduled execution of the task (for example the next day).

-- 
Ticket URL: <http://invenio-software.org/ticket/853>
Invenio <http://invenio-software.org>

Reply via email to