Re: master-slave syncing

Ferran Jorba Mon, 25 Nov 2013 00:54:49 -0800

Hello Henning,

>>> If it's only about serving the bibliographic data I'd go
>>> for OAI-PMH as it makes a smaller footprint and scales
>>> way better than dumping. Just set your master to expose
>>> the collection as OAI-PMH and harvest it from your slave
>>> periodically. Depending on the nature of your project
>>> you'd most likely want to have the OAI-Server anyway for
>>> visibility reasons.
>>
>> Thank you for the idea, since I’m not familiar with OAI,
>> I didn’t give consideration to that possibility.
>
> Usually, it should offer most flexibility as it works though
> http-protocol and all the stuff is quite high-level. Say,
> you move on and drop MySQL on the go to next-next, OAI
> would still work. We even migrated one of our instances from
> a proprietary solution to Marc using OAI exports.


I do think, like Alexander, that it is better to go via dumping and
reloading marc records rather than lower-level MySQL tools.

However, OAI has been, at least to me, somewhat involving, specially the
understanding the corner cases and their consequences.  Moreover, in
your case, where there is exactly one master and one slave (or, maybe in
the future, more than one slave), you may find simpler solutions easier
to understand and use.

Invenio search API
(http://invenio-demo.cern.ch/help/hacking/search-engine-api) is great in
many respects, because it offers many possibilities and it mirrors the
URL search parameters.  Among other things, it allows you to get the new
and modified records form a database.  At UAB I do a daily dump of the
bibliographic records to a sqlite database, for offline batch
postprocessing.  For this, I just get the new or modified records and
update the databse.  I have done a minimal modification to it so I took
some internal stuff, but it passes a quick test.  You may find this
script useful to as an example to extract new, modified and deleted
records from your master site to upload then (via bibupload -ri) to the
slave site(s).  Excuse that it may not be as polished as it should but,
again, I use for internal purposes.

I understand that in newer Invenio releases it is possible to bibupload
records in text marc (tm) format.  In the version of the script I attach
I haven't been able to test it, as in 1.1.1 it cannot be done, so in the
print_record function I use the 'xm' parameter; probably you may change
it to 'tm'.

Hope it helps,

Ferran

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Time-stamp: <2013.11.25 09:36:56 marcdump.py [email protected]>

from __future__ import print_function, division

import os
import sys
import time
import pprint
import sqlite3

sys.path.append(os.path.expanduser('~/lib/python/'))
from invenio.search_engine import perform_request_search, \
    search_pattern, print_record

max_records = 0

def create_db(dbname):
    sql = '''
create table records (
       recid integer primary key,
      record varchar
);'''
    db = sqlite3.connect(dbname)
    db.execute(sql)
    db.close()


def update_db(db, since):
    sql = '''replace into records values (?, ?);'''
    gmtime = time.gmtime(since)
    (year, month, day) = gmtime[:3]
    recids = perform_request_search(dt='m', d1y=year, d1m=month, d1d=day)
    recids.reverse()
    n = 0
    for recid in recids:
        n += 1
        if max_records and n > max_records:
            break
        marc = print_record(recid, 'xm')
        time.sleep(0.001)
        values = (recid, unicode(marc, 'utf-8'))
        db.execute(sql, values)
    db.commit()

    deleted_record_fmt = ''''
<record>
  <controlfield tag="001">%(recid)s</controlfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="c">DELETED</subfield>
  </datafield>
</record>'''
    deleted_recids = search_pattern(p='deleted', f='980').tolist()
    for recid in deleted_recids:
        marc = deleted_record_fmt % (recid)
        values = (recid, unicode(marc, 'utf-8'))
        db.execute(sql, values)
    db.commit()


def dump_db(db):
    sql = '''
    select recid, record
      from records
  order by recid desc;'''
    cursor = db.cursor()
    for row in cursor.execute(sql):
        print(row[1].encode('utf-8'))
        print()


def main():
    if len(sys.argv) < 2:
        print('usage: %s database.db' % (sys.argv[0]))
    dbname = sys.argv[1]
    if os.path.isfile(dbname):
        mtime = os.path.getmtime(dbname)
    else:
        create_db(dbname)
        mtime = 0
    db = sqlite3.connect(dbname)
    update_db(db, mtime) # TODO: split in two operations depending on params
    dump_db(db)
    db.close()


if __name__ == '__main__':
    main()

Re: master-slave syncing

Reply via email to