Dear Christian, Piotr and Sam,
Thanks a lot for this very interesting discussion.
I just want to let you know that these days is taking place the
BlogForever EU Project Invenio workshop, and the support for METS within
Invenio (Import/Export as you describe it) is clearly a feature that
would be very beneficial for this project as well.
So , in all cases, the support of METS within Invenio is in the pipeline :-)
It would be perfect to make it such that it fits Piotr's figure
management project.
Cheers,
JY
On 06/16/2011 08:31 PM, Cristian Bacchi wrote:
Hello !
I'm happy to share this interest in digital standards, while I
absolutely understand that your final concern is to plan developing
effort.
So I replay to your questions, with the only aim of giving ideas for
your concrete study on Invenio data-model.
On Tue, Jun 14, 2011 at 8:08 PM, Piotr Praczyk <piotr.prac...@cern.ch
<mailto:piotr.prac...@cern.ch>> wrote:
This is not the use case of figures from scientific publications
(If I understand correctly, looks rather like digitalisation of
entire documents), though seems to be relevant for Invenio/Inspire
in general. Looks like a nice benchmark of the underlying
data-structures.
OK, I understand. And I agree.
>Speaking in concrete words: in my experience quite every time I saw,
>- descriptive-metadata (like MARC) managed on one side with
specific (multiple) identifiers (..also modifiable identifiers, in
the collaborative systems..)
>- digital-repositories, on the other side, with specific
(stable!!) identifiers for digital-objects and their component files,
>- and, in the middle, digital-metadata (like METS) which
guarantee the connections (regardless the physical file storage).
I think, I did not understand this part.
I used this example only to sustain that (in the field of books
digitization) usually descriptive-metadata can continuously change,
while digital-metadata remain stable. (That's why we benefit from a
separation between standards like MARC and METS on the two sides).
What are the cases of modifiable identifiers inside MARC ? Titles
of documents + authors ?
It's the "extreme case" I have to deal with, in my Invenio tests :-(
It happens (in library collaborative network) when two MARC records
describe the same publication, maybe coming from two different
libraries which described their own book-copies. The two record can be
merged (say: "B" is chosen and includes the copies of "A"), so that in
the export from that system I receive the new-record ("B plus A
copies"), with reference to the old-record to be replaced (->"A").
In this case: the Invenio "representation" of the MARC-record can
change (..titles, authors and also system-identifier), but maintains
its internal-Invenio-ID, because the publication is the same (so that
I have to maintainInvenio added information like user-comments, or
digitization).
In my opinion, this absolutely doesn't affect the Invenio data-model,
it only affects the Invenio importing procedures: personally, I worked
on the level of BibConvert. But Samuele recently (mailing-list,
2011/03/31, "RFC: bibupload --merge for WebSubmit") explained that the
merging procedure can be made with human control using the new
BibMerge web interface.
Exact file paths in the file system (as we happen to still have in
some places in Inspire ?)
No no no: in my little case, please, consider Invenio as a
service-provider where multiple data-flows come, and each record
receives a permanent Invenio-identifier, and permanent-pointers to
digitizations. (I hope this replays)
By link between two do you mean a document identifying the same
document with both at the same time ?
I simply mean this:
- MARC could point to METS (ex: using 856 field for a link to a METS
file of the same record). But it's better if
- METS points-to or englobe MARC; and points-to FILEs, describing
their features (md5, format, dimension, URL/URN/URI, ...), their
document structure, access rights, etc etc.
This from the simple view point of exports (and, potentially, import).
While, from the view point of the data internally managed (internally
created/modified/only_indexed), I know it's a different subject: I
like Samuele's expression "/it would be nice to support METS in
importing and exporting, (by storing a side when importing anything
that is not understood, so that it can be re-exported)/"/.
/I interpret that in this way: Invenio could
- accepts a (configurable?) selection of METS profiles, for import
(after validation?), store (as an XML blob?), and export;
- and understands a (configurable?) selection of METS elements
(extracted from blob with something like XPath or XQuery, and stored
in Invenio tables?), for its internal management (files pointers,
simple doc.structure and relation among files [versions, formats, pages]).
What is physical storage for You ? From the physical storage I
wanted to abstract exactly by providing such links /object/DOCID
I'm using "physical storage" by the general meaning of disks / servers
/ storage-center / external-service-for-digitization / or every
solution for the referred files to be accessed.
What I'm trying to propose is this: Invenio could guarantee
- persistence of identifiers and logical file-pointers (like your
/object/DOCID) within metadata; and
- configurable interpretation of the logical pointers.
Example: if a collection of images [/nnnn/CollX_DocY] grows so much
that I decide to transport all files in a bigger storage system, I'll
be able to redirect all that pointers towards the new system without
changing the metadata, only by a "re-configuration" of the resolver
for that collection-pointers.
(I know It's a trivial problem that can be solved also at a lower
level than Invenio installation, but I wanted to share a big
preoccupation I saw in digitization projects).
I think, the word "version" creates confusion here as version in
this sense is format in Invenio.
The version which I was talking about is a number telling, how
many times object was modified. Maybe revision is a better word.
Thank you: I really misunderstood. And of course I agree with you
about the importance of revisions too.
Indeed I think that also that feature could be supported in METS:
please take a look to this record of the Library of Congress
(http://lcweb2.loc.gov/diglib/ihas/loc.natlib.gottlieb.09601/contactsheet.html),
which contains 2 revisions (effectively called "versions") of the same
picture, both with 2 formats (tiff and jpg).
[.. I'm still looking for an official METS example with: many pages
(sections / figures), in many formats, with many revisions...]
I was rather thinking about providing a link between for example
1st revision of the full text (whichever format) and 3rd revision
of a figure. Assigning data to connection between particular
revisions will be important from the point of view of processing
of figures.
Very interesting ! If I understand, you are looking to support the
complete work-flow on a document with possible connection between any
stage of any component part (rather than the complete situation of the
only final stage).
Well: really I think that METS supports that, and It's only a matter
of conventional usage of its basic elements.
Referring only to the basic schema of METS
(http://sunsite3.berkeley.edu/mets/diagram/ and
http://www.loc.gov/standards/mets/docs/mets.v1-9.html) we find that:
- the <file> elements (with all their singular specification about
_formats_, description and technical data) can be organized in
whatever quantity of nested <fileGrp> and <fileGrpType> elements.
- At every level the "attribute:USE" can record information about its
usage (.. master, reference, thumbnails..).
- And _revisions_ (of each file-format) can be recorded within
apposite <fileGrp> elements using the "attribute:VERSDATE" ("/An
optional dateTime attribute specifying the date this version/fileGrp
of the digital object was created/")
Looking from my perspective, I think it would be nice to repeat
the example in custom XML I proposed few mails ago and see if it
can be easily reproduced..
You are right: some concrete example (..let me only start with
official examples..):
- The above LoC record has a METS file which contains all the most
important informations
(http://lcweb2.loc.gov/diglib/ihas/loc.natlib.gottlieb.09601/mets.xml): please
look the last 35 lines (and the MARC wrapped description, at lines 84-291)
- The METS+PREMIS_profile export of the same record
(http://www.loc.gov/standards/premis/louis-2-0.xml) presents
additional information of all the events operated for the storage:
(lines 472-639) "validation, ingestion, migration"
[..Maybe a simple re-use of these files, inserting your data, could be
a valid starting example?]
Thanks very much for your attention (..I don't know whether I'm
annoying the mailing-list, so feel free to ask me directly whatever
you think I know)
Cheers
Cristian