Thanks, guys! This is all really helpful. I have the update method written, so now I'm just refining the tests and adding some utility methods to turn Ruby pseudo-objects back into JSON objects.
On Fri, Mar 15, 2013 at 7:43 AM, Lee Passey <[email protected]> wrote: > On 3/14/2013 7:17 PM, John Shutt wrote > > From Ben's bot, I think I have the answer my main question: You need to > > back send /complete/ Open Library objects when saving, not just partial > > objects with the modified fields. Is that correct? > > It might be helpful to understand how the OL archive is actually > implemented. > > While OL data is technically stored in a relational database, > practically it is not. The JSON object serialization that you get as a > result of a query is what is actually stored in the database. When a > "record" is updated no modifications are actually made to the record; > instead, a new record is created with the new data serialized as a JSON > object and stored as a BLOB (more accurately a TLOB) in a single field > in the database record. The new record has the same OLID but a new > time/date stamp so if you collect all the records with the same OLID you > can determine the "current" record by looking at the timestamp. > > As a consequence of this design, there is no defined database schema--or > perhaps it is more accurate to say that each and every record has its > own schema which may or may not be similar to the schema of some other > record. When OL decides to change the data stored for any particular > record the JSON object reflects that change, but there is no > modification to any previously stored object. Thus, the OL archive is > full of all sorts of deprecated data, and some newer records contain > data that some older records do not. This is not a problem if your only > goal is to present "one web page per book," but it does make reuse of > the data problematic for anything other than a single presentation for > human viewing. > > This also explains why searching for changes can fail if performed too > soon after an update: the design requires an indexing method external to > the DBMS implementation. OL uses SOLR for this purpose. To completely > reindex the archive you must read each record in the archive, parse the > JSON object to create name/value pairs, then add each of these values to > the stand-alone index. My experiments a few years ago demonstrated that > on older hardware this process required a couple of days to complete. Of > course, the process can be optimized by doing incremental updates where > only those records are indexed which are new since the last time the > indexing software is run; but this could also lead to false positives > when the "current" record no longer contains a term that the index had > previously recorded. > > My experiments also demonstrated that archive performance was just as > good, if not better, if the JSON TLOBs where simple stored as files in > the files system instead of as records in a database. > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] >
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
