On Wed, 29 Jun 2011, Piotr Praczyk wrote: > If we wanted to have all meta-data in MARC records, and records for > all entities, we would not need any major extensions to the BibDoc > infrastructure, on which I am working since a while. We would also > not need anything like partial METS support.
I think nobody proposed storing all the object metadata in MARC. File objects live naturally outside of the MARC domain and it is better done this way. For cataloguing purposes, some of the file object's metadata information may be stored in MARC indeed; e.g. to ease its maintenance, its exporting, etc. But some of this information will not live in MARC at all, rather with file object itself. It would not be good to force this kind of information into MARC. It is precisely our goal not to such an information into MARC when it does not make sense, but rather to simply make use of it, e.g. enable indexing of file object information stored in the moreinfo in an easy way, e.g. enable reaching this information for output purposes, etc. If file objects are attached to records, then this emulates a bit what we are already doing with full-text files and plot contexts. Not much changes to the overall storage architecture would be required here indeed, except for generalising the framework a little bit. Notably, (i) moving moreinfo file property store into more record-object and object-object relationships that you have already started looking into; and (ii) having easily-configured indexing of any derived information from any place, be it DB table, attached file property, related files, and whatnot, that Sam mentioned under the concept of `derived fields'. (The latter generalisation was something we wanted to do since a long time, but did not got to it yet.) One could think, so far so good, we can cover these needs relatively easily. The novelty here lies in the concept of the file object that is not attached to any MARC record. What we have discussed IRL some time ago is that these can live in bibdoc infrastructure anyway, after extending our input-output channels a bit. From the point of view of the searcher, indexer, displayer and the like, these would be attached to a bibrec entry anyway, even though this entry would be basically empty. This would enable us to reuse the other Invenio infrastructure WRT search goodies and collection goodies and curation goodies and display goodies etc. So we would still have a recID, but it would not have any MARC stuff behind it at all; only `attached' file object itself. Effectively, `recID' plays the role of `objID' in this case. One could even muse about coming with an alternative name for `/record', if wanted. Thanks to Peter's work on `/abs', we have good rudiments in this direction. Note that with the above mentioned extensions, we would still have a single record space, basically. And if curation/indexing is not needed, then some stuff can live standalone hidden in the bibdoc part of the infrastructure. This approach is advantageous in order to reuse all the existing Invenio goodies. The splitting of record space that you alluded to was probably touching our sharding discussion, i.e. if the number of plots or other objects exceeds reasonable limits (say 10M), then these can be hosted on separate instances that would still be aggregated via hosted collections like facility for the front end. From this point if view, it is not necessary to think about distinguishing r123 vs o123 and stuff, because `interesting' objects would still have a `virtual recID', so to speak, even though there would be to MARC attached. Anyway, this was a quick reaction to this topic, recapping some of the things we mused about IRL in the past. I'll come to the other METS-related thread in the coming days, so I may send some more targeted thoughts about some more concrete points later. Best regards -- Tibor Simko

