Dear Christian, Piotr and Sam,

Thanks a lot for this very interesting discussion.
I just want to let you know that these days is taking place the BlogForever EU Project Invenio workshop, and the support for METS within Invenio (Import/Export as you describe it) is clearly a feature that would be very beneficial for this project as well.

So , in all cases, the support of METS within Invenio is in the pipeline :-)
It would be perfect to make it such that it fits Piotr's figure management project.

Cheers,
  JY






On 06/16/2011 08:31 PM, Cristian Bacchi wrote:
Hello !
I'm happy to share this interest in digital standards, while I absolutely understand that your final concern is to plan developing effort. So I replay to your questions, with the only aim of giving ideas for your concrete study on Invenio data-model.

On Tue, Jun 14, 2011 at 8:08 PM, Piotr Praczyk <piotr.prac...@cern.ch <mailto:piotr.prac...@cern.ch>> wrote:

    This is not the use case of figures from scientific publications
    (If I understand correctly, looks rather like digitalisation of
    entire documents), though seems to be relevant for Invenio/Inspire
    in general. Looks like a nice benchmark of the underlying
    data-structures.

OK, I understand. And I agree.

    >Speaking in concrete words: in my experience quite every time I saw,
    >- descriptive-metadata (like MARC) managed on one side with
    specific (multiple) identifiers (..also modifiable identifiers, in
    the collaborative systems..)
    >- digital-repositories, on the other side, with specific
    (stable!!) identifiers for digital-objects and their component files,
    >- and, in the middle, digital-metadata (like METS) which
    guarantee the connections (regardless the physical file storage).

    I think, I did not understand this part.

I used this example only to sustain that (in the field of books digitization) usually descriptive-metadata can continuously change, while digital-metadata remain stable. (That's why we benefit from a separation between standards like MARC and METS on the two sides).

    What are the cases of modifiable identifiers inside MARC ? Titles
of documents + authors ?
It's the "extreme case" I have to deal with, in my Invenio tests :-(
It happens (in library collaborative network) when two MARC records describe the same publication, maybe coming from two different libraries which described their own book-copies. The two record can be merged (say: "B" is chosen and includes the copies of "A"), so that in the export from that system I receive the new-record ("B plus A copies"), with reference to the old-record to be replaced (->"A"). In this case: the Invenio "representation" of the MARC-record can change (..titles, authors and also system-identifier), but maintains its internal-Invenio-ID, because the publication is the same (so that I have to maintainInvenio added information like user-comments, or digitization). In my opinion, this absolutely doesn't affect the Invenio data-model, it only affects the Invenio importing procedures: personally, I worked on the level of BibConvert. But Samuele recently (mailing-list, 2011/03/31, "RFC: bibupload --merge for WebSubmit") explained that the merging procedure can be made with human control using the new BibMerge web interface.

    Exact file paths in the file system (as we happen to still have in
    some places in Inspire ?)

No no no: in my little case, please, consider Invenio as a service-provider where multiple data-flows come, and each record receives a permanent Invenio-identifier, and permanent-pointers to digitizations. (I hope this replays)

    By link between two do you mean a document identifying the same
    document with both at the same time ?

I simply mean this:
- MARC could point to METS (ex: using 856 field for a link to a METS file of the same record). But it's better if - METS points-to or englobe MARC; and points-to FILEs, describing their features (md5, format, dimension, URL/URN/URI, ...), their document structure, access rights, etc etc.
This from the simple view point of exports (and, potentially, import).

While, from the view point of the data internally managed (internally created/modified/only_indexed), I know it's a different subject: I like Samuele's expression "/it would be nice to support METS in importing and exporting, (by storing a side when importing anything that is not understood, so that it can be re-exported)/"/.
/I interpret that in this way: Invenio could
- accepts a (configurable?) selection of METS profiles, for import (after validation?), store (as an XML blob?), and export; - and understands a (configurable?) selection of METS elements (extracted from blob with something like XPath or XQuery, and stored in Invenio tables?), for its internal management (files pointers, simple doc.structure and relation among files [versions, formats, pages]).

    What is physical storage for You ? From the physical storage I
    wanted to abstract exactly by providing such links /object/DOCID

I'm using "physical storage" by the general meaning of disks / servers / storage-center / external-service-for-digitization / or every solution for the referred files to be accessed.
What I'm trying to propose is this: Invenio could guarantee
- persistence of identifiers and logical file-pointers (like your /object/DOCID) within metadata; and
- configurable interpretation of the logical pointers.
Example: if a collection of images [/nnnn/CollX_DocY] grows so much that I decide to transport all files in a bigger storage system, I'll be able to redirect all that pointers towards the new system without changing the metadata, only by a "re-configuration" of the resolver for that collection-pointers. (I know It's a trivial problem that can be solved also at a lower level than Invenio installation, but I wanted to share a big preoccupation I saw in digitization projects).

    I think, the word "version" creates confusion here as version in
    this sense is format in Invenio.
    The version which I was talking about is a number telling, how
    many times object was modified. Maybe revision is a better word.

Thank you: I really misunderstood. And of course I agree with you about the importance of revisions too. Indeed I think that also that feature could be supported in METS: please take a look to this record of the Library of Congress (http://lcweb2.loc.gov/diglib/ihas/loc.natlib.gottlieb.09601/contactsheet.html), which contains 2 revisions (effectively called "versions") of the same picture, both with 2 formats (tiff and jpg). [.. I'm still looking for an official METS example with: many pages (sections / figures), in many formats, with many revisions...]

    I was rather thinking about providing a link between for example
    1st revision of the full text (whichever format) and 3rd revision
    of a figure. Assigning data to connection between particular
    revisions will be important from the point of view of processing
    of figures.

Very interesting ! If I understand, you are looking to support the complete work-flow on a document with possible connection between any stage of any component part (rather than the complete situation of the only final stage). Well: really I think that METS supports that, and It's only a matter of conventional usage of its basic elements. Referring only to the basic schema of METS (http://sunsite3.berkeley.edu/mets/diagram/ and http://www.loc.gov/standards/mets/docs/mets.v1-9.html) we find that: - the <file> elements (with all their singular specification about _formats_, description and technical data) can be organized in whatever quantity of nested <fileGrp> and <fileGrpType> elements. - At every level the "attribute:USE" can record information about its usage (.. master, reference, thumbnails..). - And _revisions_ (of each file-format) can be recorded within apposite <fileGrp> elements using the "attribute:VERSDATE" ("/An optional dateTime attribute specifying the date this version/fileGrp of the digital object was created/")

    Looking from my perspective, I think it would be nice to repeat
    the example in custom XML I proposed few mails ago and see if it
    can be easily reproduced..

You are right: some concrete example (..let me only start with official examples..): - The above LoC record has a METS file which contains all the most important informations (http://lcweb2.loc.gov/diglib/ihas/loc.natlib.gottlieb.09601/mets.xml): please look the last 35 lines (and the MARC wrapped description, at lines 84-291) - The METS+PREMIS_profile export of the same record (http://www.loc.gov/standards/premis/louis-2-0.xml) presents additional information of all the events operated for the storage: (lines 472-639) "validation, ingestion, migration"

[..Maybe a simple re-use of these files, inserting your data, could be a valid starting example?]

Thanks very much for your attention (..I don't know whether I'm annoying the mailing-list, so feel free to ask me directly whatever you think I know)

Cheers
Cristian

Reply via email to