#671: BibUpload: optional use of bibxxx tables
-------------------------+---------------------
 Reporter:  simko        |      Owner:  bthiell
     Type:  enhancement  |     Status:  new
 Priority:  major        |  Milestone:
Component:  BibUpload    |    Version:
 Keywords:               |
-------------------------+---------------------
 During bibupload, the incoming record is broken according to MARC tags
 into many bibxxx tables (`bib10x`, `bib11x`, etc) which results in
 many SQL queries being done by bibupload.  Advantage is doing so is
 that the end users can then simply search in any MARC
 tag. Disadvantage in doing so is that the uploading step takes time,
 and that we are preparing indexes that may perhaps not even be used by
 the end users at all.  (Since they typically search in logical field
 indexes, say `firstauthor:ellis`, not in physical MARC tags, say
 `100__a:/ellis/`.)

 In certain situations, it would be better not to create these indexes
 during upload time, but to defer handing them for the indexing time.
 (Especially when using external indexer such as Solr for the record
 the metadata.)

 For this, it would be good to introduce a new configuration option
 called say `CFG_BIBUPLOAD_USE_BIBXXX` that would be True by default
 but that could optionally be set to False on a per-site basis.  When
 set to False, the stage 4 of bibupload (=filling of bibxxx tables)
 would not be executed.

 This would result in bibupload speed-ups that can be illustrated by
 the following example taken from INSPIRE-sized database (1M of
 records):

 * example record CERN-TH-6002-91 from INSPIRE TEST (record ID 315385)

 * timings to replace it, stage 4 enabled:

 {{{
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.006    0.006    4.112    4.112 bibupload.py:162(bibupload)
       256    0.003    0.000    4.095    0.016 dbquery.py:141(run_sql)
         1    0.001    0.001    2.632    2.632
 bibupload.py:1550(update_database_with_metadata)
       109    0.001    0.000    2.605    0.024
 bibupload.py:822(insert_record_bibxxx)
         1    0.000    0.000    1.255    1.255
 bibupload.py:1780(delete_bibrec_bibxxx)
 }}}

 * timings to replace it, stage 4 disabled:

 {{{
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000    0.020    0.020 bibupload.py:162(bibupload)
        37    0.001    0.000    0.017    0.000 dbquery.py:141(run_sql)
         1    0.000    0.000    0.006    0.006
 bibupload.py:1780(delete_bibrec_bibxxx)
 }}}

 As can be seen, the upload time is faster by several orders of
 magnitude, since we are not pre-creating those huge and possibly
 non-useful bibxxx indexes.

 **Important note:** while it is simple to introduce such a
 `CFG_BIBUPLOAD_USE_BIBXXX` variable for record uploading processes,
 this variable should be propagated to other Invenio modules such as
 searcher/indexer that should read record metadata from pre-stored
 MARCXML formats (see table `bibfmt`) rather than from `bibxxx` tables.
 When `bibxxx` tables are not in use, other Invenio modules are not
 free to rely on the existence of `bibxxx` tables anymore.  So this
 task is really bigger than it may seem.  The settings of
 `CFG_BIBUPLOAD_USE_BIBXXX` should therefore be progressively
 propagated to all the Invenio modules that assume the existence of
 `bibxxx` for granted, starting with the the most important modules
 (indexer, searcher, editor, check for deleted records, etc).

-- 
Ticket URL: <https://invenio-software.org/ticket/671>
Invenio <http://invenio-software.org>

Reply via email to