On Wed, 29 Jun 2011, Piotr Praczyk wrote:
> If we wanted to have all meta-data in MARC records, and records for
> all entities, we would not need any major extensions to the BibDoc
> infrastructure, on which I am working since a while.  We would also
> not need anything like partial METS support.

I think nobody proposed storing all the object metadata in MARC.  File
objects live naturally outside of the MARC domain and it is better done
this way.  For cataloguing purposes, some of the file object's metadata
information may be stored in MARC indeed; e.g. to ease its maintenance,
its exporting, etc.  But some of this information will not live in MARC
at all, rather with file object itself.  It would not be good to force
this kind of information into MARC.

It is precisely our goal not to such an information into MARC when it
does not make sense, but rather to simply make use of it, e.g. enable
indexing of file object information stored in the moreinfo in an easy
way, e.g. enable reaching this information for output purposes, etc.  If
file objects are attached to records, then this emulates a bit what we
are already doing with full-text files and plot contexts.  Not much
changes to the overall storage architecture would be required here
indeed, except for generalising the framework a little bit.  Notably,
(i) moving moreinfo file property store into more record-object and
object-object relationships that you have already started looking into;
and (ii) having easily-configured indexing of any derived information
from any place, be it DB table, attached file property, related files,
and whatnot, that Sam mentioned under the concept of `derived fields'.
(The latter generalisation was something we wanted to do since a long
time, but did not got to it yet.)

One could think, so far so good, we can cover these needs relatively
easily.  The novelty here lies in the concept of the file object that is
not attached to any MARC record.  What we have discussed IRL some time
ago is that these can live in bibdoc infrastructure anyway, after
extending our input-output channels a bit.  From the point of view of
the searcher, indexer, displayer and the like, these would be attached
to a bibrec entry anyway, even though this entry would be basically
empty.  This would enable us to reuse the other Invenio infrastructure
WRT search goodies and collection goodies and curation goodies and
display goodies etc.  So we would still have a recID, but it would not
have any MARC stuff behind it at all; only `attached' file object
itself.  Effectively, `recID' plays the role of `objID' in this case.
One could even muse about coming with an alternative name for `/record',
if wanted.  Thanks to Peter's work on `/abs', we have good rudiments in
this direction.

Note that with the above mentioned extensions, we would still have a
single record space, basically.  And if curation/indexing is not needed,
then some stuff can live standalone hidden in the bibdoc part of the
infrastructure.  This approach is advantageous in order to reuse all the
existing Invenio goodies.  

The splitting of record space that you alluded to was probably touching
our sharding discussion, i.e. if the number of plots or other objects
exceeds reasonable limits (say 10M), then these can be hosted on
separate instances that would still be aggregated via hosted collections
like facility for the front end.  From this point if view, it is not
necessary to think about distinguishing r123 vs o123 and stuff, because
`interesting' objects would still have a `virtual recID', so to speak,
even though there would be to MARC attached.

Anyway, this was a quick reaction to this topic, recapping some of the
things we mused about IRL in the past.  I'll come to the other
METS-related thread in the coming days, so I may send some more targeted
thoughts about some more concrete points later.

Best regards
-- 
Tibor Simko

Reply via email to