On Thu, Aug 7, 2008 at 7:09 PM, Brandon W. Uhlman <[EMAIL PROTECTED]> wrote: > Hi, [snip] > > My proposal is to make the materialization process a little lazier. Or to > quote Montgomery Scott, chief engineer of the Enterprise, to lock the > rematerialization subroutine directly into the transporter's pattern buffer. > > First, let's add a staleness bit/boolean to metabib.full_rec. The staleness > bit gets set to true if the row is updated or inserted when changes to the > simple record view are not being materialized -- that is, because > disable_materialized_simple_record_trigger() has been called. On reenabling > the trigger, instead of truncating the entire table and rebuilding from > scratch, only rebuild those entries which have the staleness bit set.
This concerns me because it will (at least) double number of writes that has to occur to metabib.full_rec: once on initial insert/update, and once to clear the flag. It's already the biggest table in there. For those situations where the number of records is small enough that disabling the trigger is not warranted (I imported about 60k in a couple minutes with the trigger in place (yes, I know, would be 30s or so, plus refreshing of the mat-view, without the trigger) recently, and that was acceptable in my mind) this will cause a slow-bleed bloat of the table or, when autovacuum is in use, a very sparse and unordered population of the table which will end up making access slower because more disk pages will have to be read for any record-ordered queries (such as, it turns out, population of the mat-view, among certain other things). At the other end of the spectrum, you have the ginormous insert case, which is already going to take a while and presents the same bloating issue (arguably worse, because it's invalidating a huge block in the middle of the table, preventing tail truncation at vacuum time) and is not going to significantly speed up the regeneration. In fact, it will end up slowing it down, because of the updates back to metabib.full_rec which would have to unset the flag (and cause the "big hole in the middle" bloat). > > If we're feeling particularly irreverent, we could call the column in the > database that says whether something is waiting to be rematerialized > 'in_pattern_buffer'. > > Ok, so I just re-enabled the trigger on a dataset of about 920,000 records, > and it only took 15 minutes. I imagined it would take longer. Nonetheless, I > am philosophically opposed to doing work I don't think I have to, so I'm > putting this idea out to gather moss. > Heh ... well, I understand what you mean, but with Dan's recent addition to automatically use the de/enable stored procs you don't have to do anything extra. I'd estimate that direct_ingest alone took about 6 CPU-hours on that dataset, so IMO, the simpler (and less maintenance-causing) setup we have now is a better option. It's a tradeoff between rare extra time of waiting for the mat-view to rebuild or incurring a double-write penalty for /every/ insert and update to the metabib.full_rec table, including during normal cataloging. FWIW, and as another datapoint, I recently did a test import of 1.3M records with a hand-modified parallel_pg_loader output file including the dis/enable calls inside the transaction (along with some wrapping DROP INDEX and CREATE INDEX calls) and it took about 30 minutes, IIRC. How often would one import such a dataset? Well, for me it's pretty often, actually (because, well, that's a lot of what ESI does ;) ), but for a single production installation, relatively rarely. > Thoughts? > > ~B > > (*) I guess it would if you wrapped the whole thing in one big transaction, > but the whole reason I'm thinking of this is that I disabled the trigger > outside a transaction, because I had multiple imports to do. I'm not quite > sure how transactions work in PostgreSQL. Do I cost myself any performance > by having a single "gigantically enormous" transaction, instead of a bunch > of "just pretty big" ones? > As long as there are less than a billion command IDs inside the transaction (not impossible if you were importing, say, 50-100M records at once) then it's less overall overhead, and less chance of running into a problem with data half-loaded that would require dropping everything. The worst case would be a ROLLBACK followed by a VACUUM FULL. -- Mike Rylander | VP, Research and Design | Equinox Software, Inc. / The Evergreen Experts | phone: 1-877-OPEN-ILS (673-6457) | email: [EMAIL PROTECTED] | web: http://www.esilibrary.com
