Hi,

I'm going to try to talk my way through the current implementation of materialized views in Evergreen, specifically the one for simple record extracts that hangs off of metabib.full_rec, and then posit an alternative. In my head, it should result in no change in general performance, and some improvement in performance in some cases (the marc2bre/direct_ingest/pg_loader dance).

Currently on an insert or update to metabib.full_rec, zzz_update_materialized_simple_record_tgr is called, which materializes the changes from the full record to the simple record extracts. This makes importing a very large set or sets of bib data incredibly slow.

The solution as discussed briefly on this list yesterday, and in more detail last month (http://list.georgialibraries.org/pipermail/open-ils-dev/2008-July/003265.html), is to use disable_materialized_simple_record_trigger() and enable_materialized_simple_record_trigger(), which:

- remove the trigger
- truncate the materialized view table (that is, empty it), refresh the data, and replace the trigger

respectively.

For continually growing datasets, this means, however, that the size of the table to materialize upon reactivating the view is not proportional to the size of the data set just added, but to the entire dataset, including that which was previously materialized. For very large datasets, the (computational) cost of rebuilding this data could be significant; in addition, the functionality provided by this data would not be available while you truncate the table (*).

My proposal is to make the materialization process a little lazier. Or to quote Montgomery Scott, chief engineer of the Enterprise, to lock the rematerialization subroutine directly into the transporter's pattern buffer.

First, let's add a staleness bit/boolean to metabib.full_rec. The staleness bit gets set to true if the row is updated or inserted when changes to the simple record view are not being materialized -- that is, because disable_materialized_simple_record_trigger() has been called. On reenabling the trigger, instead of truncating the entire table and rebuilding from scratch, only rebuild those entries which have the staleness bit set.

If we're feeling particularly irreverent, we could call the column in the database that says whether something is waiting to be rematerialized 'in_pattern_buffer'.

Ok, so I just re-enabled the trigger on a dataset of about 920,000 records, and it only took 15 minutes. I imagined it would take longer. Nonetheless, I am philosophically opposed to doing work I don't think I have to, so I'm putting this idea out to gather moss.

Thoughts?

~B

(*) I guess it would if you wrapped the whole thing in one big transaction, but the whole reason I'm thinking of this is that I disabled the trigger outside a transaction, because I had multiple imports to do. I'm not quite sure how transactions work in PostgreSQL. Do I cost myself any performance by having a single "gigantically enormous" transaction, instead of a bunch of "just pretty big" ones?

======================================
Brandon W. Uhlman, Systems Consultant
Public Library Services Branch
Ministry of Education
Government of British Columbia
605 Robson Street, 5th Floor
Vancouver, BC  V6B 5J3

Phone: (604) 660-2972
E-mail: [EMAIL PROTECTED]
        [EMAIL PROTECTED]

Reply via email to