On 3/14/2013 7:17 PM, John Shutt wrote
>  From Ben's bot, I think I have the answer my main question: You need to
> back send /complete/ Open Library objects when saving, not just partial
> objects with the modified fields. Is that correct?

It might be helpful to understand how the OL archive is actually 
implemented.

While OL data is technically stored in a relational database, 
practically it is not. The JSON object serialization that you get as a 
result of a query is what is actually stored in the database. When a 
"record" is updated no modifications are actually made to the record; 
instead, a new record is created with the new data serialized as a JSON 
object and stored as a BLOB (more accurately a TLOB) in a single field 
in the database record. The new record has the same OLID but a new 
time/date stamp so if you collect all the records with the same OLID you 
can determine the "current" record by looking at the timestamp.

As a consequence of this design, there is no defined database schema--or 
perhaps it is more accurate to say that each and every record has its 
own schema which may or may not be similar to the schema of some other 
record. When OL decides to change the data stored for any particular 
record the JSON object reflects that change, but there is no 
modification to any previously stored object. Thus, the OL archive is 
full of all sorts of deprecated data, and some newer records contain 
data that some older records do not. This is not a problem if your only 
goal is to present "one web page per book," but it does make reuse of 
the data problematic for anything other than a single presentation for 
human viewing.

This also explains why searching for changes can fail if performed too 
soon after an update: the design requires an indexing method external to 
the DBMS implementation. OL uses SOLR for this purpose. To completely 
reindex the archive you must read each record in the archive, parse the 
JSON object to create name/value pairs, then add each of these values to 
the stand-alone index. My experiments a few years ago demonstrated that 
on older hardware this process required a couple of days to complete. Of 
course, the process can be optimized by doing incremental updates where 
only those records are indexed which are new since the last time the 
indexing software is run; but this could also lead to false positives 
when the "current" record no longer contains a term that the index had 
previously recorded.

My experiments also demonstrated that archive performance was just as 
good, if not better, if the JSON TLOBs where simple stored as files in 
the files system instead of as records in a database.
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to