#864: BibUpload: optimise append/correct mode
-------------------------+----------------------
 Reporter:  simko        |      Owner:  vvenkatr
     Type:  enhancement  |     Status:  new
 Priority:  major        |  Milestone:  v1.1
Component:  BibUpload    |    Version:
 Keywords:               |
-------------------------+----------------------
 **Problem description**

 Historically, Invenio records were being primarily catalogued in another
 library system from which the records were periodically synchronised back
 into Invenio.  Therefore, BibUpload was originally build with //replace//
 mode in sight, when the current record in Invenio gets completely replaced
 by a new record version coming in the upload job.  This makes //append//
 and //correct// modes in BibUpload being not very optimised performance-
 wise.

 The goal of this ticket is to take advantage of the smarter record version
 verifier from ticket:816 in order to optimise append/correct mode.

 **Sample profiling data**

 Here is a concrete example illustrating the ineffectiveness of
 append/correct mode of bibupload operation.

 A sample file was being uploaded to INSPIRE and consisted of changes
 solely to collection identifiers, i.e. the incoming records contained only
 970 and 980 tags:

 {{{
 $ less /afs/cern.ch/project/inspire/confcoll_test.xml
 <collection>
 <record>
 <datafield tag="970" ind1=" " ind2=" "><subfield
 code="a">SPIRES-7847793</subfield></datafield>
 <datafield tag="980" ind1=" " ind2=" "><subfield
 code="a">ConferencePaper</subfield></datafield>
      <datafield tag="980" ind1=" " ind2=" "><subfield
 code="a">Citeable</subfield></datafield>
      <datafield tag="980" ind1=" " ind2=" "><subfield
 code="a">CORE</subfield></datafield>
 </record>
 <record>
 <datafield tag="970" ind1=" " ind2=" "><subfield
 code="a">SPIRES-7847807</subfield></datafield>
 <datafield tag="980" ind1=" " ind2=" "><subfield
 code="a">ConferencePaper</subfield></datafield>
      <datafield tag="980" ind1=" " ind2=" "><subfield
 code="a">Citeable</subfield></datafield>
      <datafield tag="980" ind1=" " ind2=" "><subfield
 code="a">CORE</subfield></datafield>
 </record>
 [...]
 }}}

 The upload was done in the correct mode and took a bit more than one
 second per record: (which would be clearly not acceptable should a job
 touch 1M of records)

 {{{
 PCUDSSW1506> less /opt/cds-invenio/var/log/bibsched_task_19327.log
 2011-12-08 14:28:57 --> Task #19327 started.
 2011-12-08 14:28:57 --> Input file '/tmp/confcoll_test.xml', input mode
 'correct'.
 2011-12-08 14:29:13 --> Record 775166 DONE
 2011-12-08 14:29:13 --> Record 775167 DONE
 2011-12-08 14:29:13 --> Record 775168 DONE
 [...]
 2011-12-08 14:35:19 --> Record 792476 DONE
 2011-12-08 14:35:20 --> Record 792482 DONE
 2011-12-08 14:35:22 --> Record 792490 DONE
 2011-12-08 14:35:22 --> Task stats: 303 input records, 303 updated, 0
 inserted, 0 errors, 0 inserted to holding pen.  Time 386.18 sec.
 2011-12-08 14:35:22 --> Task #19327 finished. [DONE]
 }}}

 Here is the profiling data:

 {{{
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.002    0.002  385.160  385.160 bibtask.py:709(_task_run)
         1    0.038    0.038  385.127  385.127
 bibupload.py:2037(task_run_core)
       303    0.065    0.000  384.588    1.269 bibupload.py:125(bibupload)
    153403    1.997    0.000  337.916    0.002 dbquery.py:151(run_sql)
    153403    2.956    0.000  331.396    0.002 cursors.py:129(execute)
    153403    0.543    0.000  325.573    0.002 cursors.py:311(_query)
    153403  319.412    0.002  323.033    0.002 cursors.py:273(_do_query)
       303    0.557    0.002  187.090    0.617
 bibupload.py:1617(update_database_with_metadata)
     43601    0.350    0.000  117.853    0.003
 bibupload.py:890(insert_record_bibrec_bibxxx)
       303    0.057    0.000  103.269    0.341
 bibupload.py:1847(delete_bibrec_bibxxx)
       239    0.296    0.001   72.473    0.303
 bibupload.py:910(synchronize_8564)
     43601    1.406    0.000   67.998    0.002
 bibupload.py:845(insert_record_bibxxx)
      1084    0.005    0.000   63.459    0.059 bibdocfile.py:559(__init__)
      1084    0.104    0.000   63.453    0.059
 bibdocfile.py:666(build_bibdoc_list)
     12398    0.362    0.000   62.350    0.005 bibdocfile.py:1391(__init__)
     13242    1.317    0.000   44.293    0.003
 bibdocfile.py:2411(_build_file_list)
       845    0.102    0.000   41.395    0.049
 bibupload.py:923(merge_marc_into_bibdocfile)
       239    0.007    0.000   30.741    0.129
 bibupload.py:990(get_bibdocfile_managed_info)
     13242    0.043    0.000   17.706    0.001 bibdocfile.py:3189(__init__)
     13242   17.603    0.001   17.663    0.001 bibdocfile.py:3222(load)
       606    0.166    0.000   10.233    0.017
 bibupload.py:1552(update_bibfmt_format)
     13242    9.384    0.001    9.384    0.001 posixpath.py:168(exists)
       844    0.010    0.000    8.485    0.010
 bibdocfile.py:2074(set_description)
     24797    1.522    0.000    7.490    0.000 bibdocfile.py:2631(__init__)
       844    0.005    0.000    6.876    0.008 bibdocfile.py:1624(touch)
       606    0.008    0.000    6.291    0.010
 search_engine.py:3949(get_record)
     12398    0.099    0.000    5.853    0.000
 bibdocfile.py:2505(_build_related_file_list)
    153403    1.848    0.000    3.412    0.000
 cursors.py:107(_do_get_result)
       102    0.080    0.001    3.208    0.031
 bibupload.py:823(insert_bibfmt)
    153403    0.400    0.000    2.932    0.000 cursors.py:57(__del__)
       303    0.006    0.000    2.706    0.009
 bibupload.py:1596(archive_marcxml_for_history)
    153403    0.460    0.000    2.533    0.000 cursors.py:62(close)
    153403    0.536    0.000    2.494    0.000 connections.py:217(cursor)
       204    0.007    0.000    2.250    0.011
 search_engine.py:3961(print_record)
     24797    2.112    0.000    2.112    0.000 posixpath.py:137(getsize)
    153403    0.850    0.000    2.073    0.000 cursors.py:87(nextset)
    153403    0.602    0.000    1.997    0.000
 cursors.py:316(_post_get_result)
    153403    1.958    0.000    1.958    0.000 cursors.py:40(__init__)
    148132    0.850    0.000    1.503    0.000 connections.py:236(literal)
    153403    0.678    0.000    1.394    0.000 cursors.py:282(_fetch_row)
    227277    0.852    0.000    1.387    0.000 cursors.py:338(fetchall)
    153403    1.143    0.000    1.322    0.000 cursors.py:309(_get_result)
     49594    0.554    0.000    1.287    0.000 urlutils.py:375(create_url)
 [...]
 }}}

 One can see that in spite of the fact that we are modifying only
 information stored in MARC tag 980 -- i.e. we would basically have to
 modify only `bibfmt` table (due to new MARCXML) and `bibrec`,
 `bibrec_bib98x`, `bib98x` tables (due to new 980 values) -- the system is
 spending a lot of time in functions like `delete_bibrec_bibxxx()` and
 `synchronize_8564()`.

 The problem is that `delete_bibrec_bibxxx()` gets currently called in the
 same way for any input file and so it destroys all `bibxxx` information
 and has to reconstruct everything from scratch.  (This is suitable for
 //replace// mode, but not for //correct// mode, since we are unnecessarily
 touching MARC tags that do not necessarily change in given input file.)

 **Solution**

 The basic idea is to change the workflow so that only tables that need to
 be updated are updated.

 One possible solution is to keep the current logic of bibpuload "steps"
 but, as a first internal step of bibuploading a record, we can run a check
 to see which fields will be changed, and store this into an internal
 variable say `affected_tags`. This variable will then be passed everywhere
 as needed.  E.g. `delete_bibrec_bibxxx()` will have a new optional
 argument named `affected_tags` saying which tags will be affected and so
 which `bibxxx` tables are to be pruned. (By default, everything should be
 pruned, to keep the hitherto behaviour.) The same optional argument
 `affected_tags` should be carried over to later uploading steps, e.g. when
 re-constructing `bibxxx` tables. (This is similar to how the meaning of
 `now` is carried over in order to ensure the same record revision
 timestamps everywhere in ticket:816.)

 Alternatively, we could change the logic of bibupload "steps" so as
 perform them in different order, but this would require more changes to
 the core code.

 This will gain a lot of time for input files like the one mentioned above,
 since only a small number of `bibxxx` tables will need to be touched,
 leading to much less SQL queries to perform, etc.

 **P.S.**

 After introduction of smarter record-revision-aware uploader from
 ticket:816, all upload jobs coming in //replace// mode that bear the
 record revision version in MARC tag 005 will be internally converted to
 //correct// mode first. The motivation for this conversion was to optimise
 possible conflict resolution when multiple cataloguers work on different
 parts of the same record simultaneously.  But, as a by product, these
 upload jobs will also get processed faster, so this issue will have an
 even higher importance and urgency, once ticket:816 is put to production.

 As usual, an extensive test case suite should be written to cover all the
 various use cases, including FFT updates.

-- 
Ticket URL: <http://invenio-software.org/ticket/864>
Invenio <http://invenio-software.org>

Reply via email to