#864: BibUpload: optimise append/correct mode
-------------------------+----------------------
Reporter: simko | Owner: vvenkatr
Type: enhancement | Status: new
Priority: major | Milestone: v1.1
Component: BibUpload | Version:
Keywords: |
-------------------------+----------------------
**Problem description**
Historically, Invenio records were being primarily catalogued in another
library system from which the records were periodically synchronised back
into Invenio. Therefore, BibUpload was originally build with //replace//
mode in sight, when the current record in Invenio gets completely replaced
by a new record version coming in the upload job. This makes //append//
and //correct// modes in BibUpload being not very optimised performance-
wise.
The goal of this ticket is to take advantage of the smarter record version
verifier from ticket:816 in order to optimise append/correct mode.
**Sample profiling data**
Here is a concrete example illustrating the ineffectiveness of
append/correct mode of bibupload operation.
A sample file was being uploaded to INSPIRE and consisted of changes
solely to collection identifiers, i.e. the incoming records contained only
970 and 980 tags:
{{{
$ less /afs/cern.ch/project/inspire/confcoll_test.xml
<collection>
<record>
<datafield tag="970" ind1=" " ind2=" "><subfield
code="a">SPIRES-7847793</subfield></datafield>
<datafield tag="980" ind1=" " ind2=" "><subfield
code="a">ConferencePaper</subfield></datafield>
<datafield tag="980" ind1=" " ind2=" "><subfield
code="a">Citeable</subfield></datafield>
<datafield tag="980" ind1=" " ind2=" "><subfield
code="a">CORE</subfield></datafield>
</record>
<record>
<datafield tag="970" ind1=" " ind2=" "><subfield
code="a">SPIRES-7847807</subfield></datafield>
<datafield tag="980" ind1=" " ind2=" "><subfield
code="a">ConferencePaper</subfield></datafield>
<datafield tag="980" ind1=" " ind2=" "><subfield
code="a">Citeable</subfield></datafield>
<datafield tag="980" ind1=" " ind2=" "><subfield
code="a">CORE</subfield></datafield>
</record>
[...]
}}}
The upload was done in the correct mode and took a bit more than one
second per record: (which would be clearly not acceptable should a job
touch 1M of records)
{{{
PCUDSSW1506> less /opt/cds-invenio/var/log/bibsched_task_19327.log
2011-12-08 14:28:57 --> Task #19327 started.
2011-12-08 14:28:57 --> Input file '/tmp/confcoll_test.xml', input mode
'correct'.
2011-12-08 14:29:13 --> Record 775166 DONE
2011-12-08 14:29:13 --> Record 775167 DONE
2011-12-08 14:29:13 --> Record 775168 DONE
[...]
2011-12-08 14:35:19 --> Record 792476 DONE
2011-12-08 14:35:20 --> Record 792482 DONE
2011-12-08 14:35:22 --> Record 792490 DONE
2011-12-08 14:35:22 --> Task stats: 303 input records, 303 updated, 0
inserted, 0 errors, 0 inserted to holding pen. Time 386.18 sec.
2011-12-08 14:35:22 --> Task #19327 finished. [DONE]
}}}
Here is the profiling data:
{{{
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 385.160 385.160 bibtask.py:709(_task_run)
1 0.038 0.038 385.127 385.127
bibupload.py:2037(task_run_core)
303 0.065 0.000 384.588 1.269 bibupload.py:125(bibupload)
153403 1.997 0.000 337.916 0.002 dbquery.py:151(run_sql)
153403 2.956 0.000 331.396 0.002 cursors.py:129(execute)
153403 0.543 0.000 325.573 0.002 cursors.py:311(_query)
153403 319.412 0.002 323.033 0.002 cursors.py:273(_do_query)
303 0.557 0.002 187.090 0.617
bibupload.py:1617(update_database_with_metadata)
43601 0.350 0.000 117.853 0.003
bibupload.py:890(insert_record_bibrec_bibxxx)
303 0.057 0.000 103.269 0.341
bibupload.py:1847(delete_bibrec_bibxxx)
239 0.296 0.001 72.473 0.303
bibupload.py:910(synchronize_8564)
43601 1.406 0.000 67.998 0.002
bibupload.py:845(insert_record_bibxxx)
1084 0.005 0.000 63.459 0.059 bibdocfile.py:559(__init__)
1084 0.104 0.000 63.453 0.059
bibdocfile.py:666(build_bibdoc_list)
12398 0.362 0.000 62.350 0.005 bibdocfile.py:1391(__init__)
13242 1.317 0.000 44.293 0.003
bibdocfile.py:2411(_build_file_list)
845 0.102 0.000 41.395 0.049
bibupload.py:923(merge_marc_into_bibdocfile)
239 0.007 0.000 30.741 0.129
bibupload.py:990(get_bibdocfile_managed_info)
13242 0.043 0.000 17.706 0.001 bibdocfile.py:3189(__init__)
13242 17.603 0.001 17.663 0.001 bibdocfile.py:3222(load)
606 0.166 0.000 10.233 0.017
bibupload.py:1552(update_bibfmt_format)
13242 9.384 0.001 9.384 0.001 posixpath.py:168(exists)
844 0.010 0.000 8.485 0.010
bibdocfile.py:2074(set_description)
24797 1.522 0.000 7.490 0.000 bibdocfile.py:2631(__init__)
844 0.005 0.000 6.876 0.008 bibdocfile.py:1624(touch)
606 0.008 0.000 6.291 0.010
search_engine.py:3949(get_record)
12398 0.099 0.000 5.853 0.000
bibdocfile.py:2505(_build_related_file_list)
153403 1.848 0.000 3.412 0.000
cursors.py:107(_do_get_result)
102 0.080 0.001 3.208 0.031
bibupload.py:823(insert_bibfmt)
153403 0.400 0.000 2.932 0.000 cursors.py:57(__del__)
303 0.006 0.000 2.706 0.009
bibupload.py:1596(archive_marcxml_for_history)
153403 0.460 0.000 2.533 0.000 cursors.py:62(close)
153403 0.536 0.000 2.494 0.000 connections.py:217(cursor)
204 0.007 0.000 2.250 0.011
search_engine.py:3961(print_record)
24797 2.112 0.000 2.112 0.000 posixpath.py:137(getsize)
153403 0.850 0.000 2.073 0.000 cursors.py:87(nextset)
153403 0.602 0.000 1.997 0.000
cursors.py:316(_post_get_result)
153403 1.958 0.000 1.958 0.000 cursors.py:40(__init__)
148132 0.850 0.000 1.503 0.000 connections.py:236(literal)
153403 0.678 0.000 1.394 0.000 cursors.py:282(_fetch_row)
227277 0.852 0.000 1.387 0.000 cursors.py:338(fetchall)
153403 1.143 0.000 1.322 0.000 cursors.py:309(_get_result)
49594 0.554 0.000 1.287 0.000 urlutils.py:375(create_url)
[...]
}}}
One can see that in spite of the fact that we are modifying only
information stored in MARC tag 980 -- i.e. we would basically have to
modify only `bibfmt` table (due to new MARCXML) and `bibrec`,
`bibrec_bib98x`, `bib98x` tables (due to new 980 values) -- the system is
spending a lot of time in functions like `delete_bibrec_bibxxx()` and
`synchronize_8564()`.
The problem is that `delete_bibrec_bibxxx()` gets currently called in the
same way for any input file and so it destroys all `bibxxx` information
and has to reconstruct everything from scratch. (This is suitable for
//replace// mode, but not for //correct// mode, since we are unnecessarily
touching MARC tags that do not necessarily change in given input file.)
**Solution**
The basic idea is to change the workflow so that only tables that need to
be updated are updated.
One possible solution is to keep the current logic of bibpuload "steps"
but, as a first internal step of bibuploading a record, we can run a check
to see which fields will be changed, and store this into an internal
variable say `affected_tags`. This variable will then be passed everywhere
as needed. E.g. `delete_bibrec_bibxxx()` will have a new optional
argument named `affected_tags` saying which tags will be affected and so
which `bibxxx` tables are to be pruned. (By default, everything should be
pruned, to keep the hitherto behaviour.) The same optional argument
`affected_tags` should be carried over to later uploading steps, e.g. when
re-constructing `bibxxx` tables. (This is similar to how the meaning of
`now` is carried over in order to ensure the same record revision
timestamps everywhere in ticket:816.)
Alternatively, we could change the logic of bibupload "steps" so as
perform them in different order, but this would require more changes to
the core code.
This will gain a lot of time for input files like the one mentioned above,
since only a small number of `bibxxx` tables will need to be touched,
leading to much less SQL queries to perform, etc.
**P.S.**
After introduction of smarter record-revision-aware uploader from
ticket:816, all upload jobs coming in //replace// mode that bear the
record revision version in MARC tag 005 will be internally converted to
//correct// mode first. The motivation for this conversion was to optimise
possible conflict resolution when multiple cataloguers work on different
parts of the same record simultaneously. But, as a by product, these
upload jobs will also get processed faster, so this issue will have an
even higher importance and urgency, once ticket:816 is put to production.
As usual, an extensive test case suite should be written to cover all the
various use cases, including FFT updates.
--
Ticket URL: <http://invenio-software.org/ticket/864>
Invenio <http://invenio-software.org>