#786: refextract: introduce daemon operation mode
--------------------------+----------------------
Reporter: simko | Owner: chayward
Type: enhancement | Status: closed
Priority: major | Milestone:
Component: RefExtract | Version:
Resolution: fixed | Keywords:
--------------------------+----------------------
Changes (by Christopher Hayward <christopher.james.hayward@…>):
* status: in_merge => closed
* resolution: => fixed
Comment:
In [85a4d09546e533f494573f8bb5a64e04e57912be]:
{{{
#!CommitTicketReference repository=""
revision="85a4d09546e533f494573f8bb5a64e04e57912be"
refextract: introduce daemon operation mode
* Convert Refextract in a form allowable for the submission of extraction
tasks via Bibtask, for Bibsched, but preserving the independent nature
of
Refextract. (Running Refextract as default will cause it to be
scheduled,
but when given a fulltext input [using -f, --fulltext], it will run in
the original standalone mode).
* Change the method of providing fulltext documents for extraction, so as
to differentiate between running in standlone mode, and running as a
scheduled task: -f and --fulltext are now used, to denote each single
fulltext document.
* Add two intermediate files: 'refextract_cli.py' and
'refextract_daemon.py' which will handle the execution mode of
Refextract, and the submission of a Refextract task to Bibsched.
* Provide the ability for Refextract to run on specific collections and
records, using the flags -c --collection and -i --recid.
* Provide the ability for Refextract to construct a new scheduled
extraction job, using a predefined 'job configuration file', by
specifying the name of the job to run (using -e, --extraction-job).
Each job corresponds to a matching named job file under /etc/bibedit,
holding the parameters for the job.
* Add functionality to interact with a new db table called xtrJOB, which
holds the id, name, and last_updated information for each ran job task
(specified using -e, or --extraction-job). Use the last_updated info to
compare against the modification_date of each record; Only newly updated
files are scheduled to have their references re-extracted.
* Include an extraction job file (refextract-job-preprints) to act as an
example template.
* Include in refextract_config, a list of acceptable job parameters which
are allowed to be specified inside a Refextract job description file.
* Change the '-s' flag for controlling the appearance of journal standard
reference form to '-p' so as not to interfere with the sleep cli option
for Bibsched.
* Update Makefile.am to reflect the addition of refextract_daemon and
refextract_cli files, and also the presence of the template extraction
job file.
* Update the refextract-specific bibtask_config.py default values for
recids and collections as empty lists. These are filled with the
location of fulltext documents when starting Refextract inside Bibsched.
* Handle all error messages regardless of the mode that Refextract is
running in. (Short error messages are shown inside the Bibsched
interface
under the 'progress' column, and all are sent to the Bibsched log when
Refextract is scheduled. Stdout or stderr are used when running
Refextract as standalone. Stdout and stderr are also used when no xml
file has been specified to hold the extracted references).
* Update the oai_harvest_daemon to call Refextract using the new fulltext
flag (-f, --fulltext).
* Include the default author kb location, used on ImportError.
* Display an error message and halt in the situation where a user
specifies
an extraction-job to run, alongside other cli options or a path to a
fulltext document from which to extract, and other daemon-specific flags
(--collection, --extraction-job).
* Display the full directory in the error message when an extraction-job
config file has not been found.
* Show in --help the three main modes for which to run Refextract.
* Inside extraction-job files, accept either an absolute path or a base
name when referencing report number and journal name knowledge bases.
In the situation where the absolute path is omitted, the daemon falls
back to the Invenio 'etc' directory.
* Add the xtrJOB table description to tabcreate.
(closes #786)
}}}
--
Ticket URL: </ticket/786#comment:4>
Invenio <http://invenio-software.org>