#671: BibUpload: optional use of bibxxx tables
-------------------------+---------------------
Reporter: simko | Owner: bthiell
Type: enhancement | Status: new
Priority: major | Milestone:
Component: BibUpload | Version:
Keywords: |
-------------------------+---------------------
During bibupload, the incoming record is broken according to MARC tags
into many bibxxx tables (`bib10x`, `bib11x`, etc) which results in
many SQL queries being done by bibupload. Advantage is doing so is
that the end users can then simply search in any MARC
tag. Disadvantage in doing so is that the uploading step takes time,
and that we are preparing indexes that may perhaps not even be used by
the end users at all. (Since they typically search in logical field
indexes, say `firstauthor:ellis`, not in physical MARC tags, say
`100__a:/ellis/`.)
In certain situations, it would be better not to create these indexes
during upload time, but to defer handing them for the indexing time.
(Especially when using external indexer such as Solr for the record
the metadata.)
For this, it would be good to introduce a new configuration option
called say `CFG_BIBUPLOAD_USE_BIBXXX` that would be True by default
but that could optionally be set to False on a per-site basis. When
set to False, the stage 4 of bibupload (=filling of bibxxx tables)
would not be executed.
This would result in bibupload speed-ups that can be illustrated by
the following example taken from INSPIRE-sized database (1M of
records):
* example record CERN-TH-6002-91 from INSPIRE TEST (record ID 315385)
* timings to replace it, stage 4 enabled:
{{{
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.006 0.006 4.112 4.112 bibupload.py:162(bibupload)
256 0.003 0.000 4.095 0.016 dbquery.py:141(run_sql)
1 0.001 0.001 2.632 2.632
bibupload.py:1550(update_database_with_metadata)
109 0.001 0.000 2.605 0.024
bibupload.py:822(insert_record_bibxxx)
1 0.000 0.000 1.255 1.255
bibupload.py:1780(delete_bibrec_bibxxx)
}}}
* timings to replace it, stage 4 disabled:
{{{
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.020 0.020 bibupload.py:162(bibupload)
37 0.001 0.000 0.017 0.000 dbquery.py:141(run_sql)
1 0.000 0.000 0.006 0.006
bibupload.py:1780(delete_bibrec_bibxxx)
}}}
As can be seen, the upload time is faster by several orders of
magnitude, since we are not pre-creating those huge and possibly
non-useful bibxxx indexes.
**Important note:** while it is simple to introduce such a
`CFG_BIBUPLOAD_USE_BIBXXX` variable for record uploading processes,
this variable should be propagated to other Invenio modules such as
searcher/indexer that should read record metadata from pre-stored
MARCXML formats (see table `bibfmt`) rather than from `bibxxx` tables.
When `bibxxx` tables are not in use, other Invenio modules are not
free to rely on the existence of `bibxxx` tables anymore. So this
task is really bigger than it may seem. The settings of
`CFG_BIBUPLOAD_USE_BIBXXX` should therefore be progressively
propagated to all the Invenio modules that assume the existence of
`bibxxx` for granted, starting with the the most important modules
(indexer, searcher, editor, check for deleted records, etc).
--
Ticket URL: <https://invenio-software.org/ticket/671>
Invenio <http://invenio-software.org>