Hi Sam and Piotr, Your exchange is extremely interesting !!! Would you like to receive a feedback from a different field of application? I speak absolutely with no reference to the work for the deadlines you mention, but only in the perspective of possible compliance with METS. So: selecting some important points in your mails...
On Sat, Jun 11, 2011 at 2:03 PM, Piotr Praczyk <[email protected]>wrote: > > What in the end is the use case of standalone documents? How > > can you later search for them? I guess they make sense only if they are > > at the same time referenced by at least another document or by a record, > >isnt't it? > > not necessarily. the use cases I know (probably there are more about which > Salvatore and Suenje have knowledge) are : > 1) (the most obvious for me) - the case of standalone plots showing an > important phenomenon. The access to them shall be provided by the figures > search. > I propose the benchmark of document digitization use-case: let's think about - one document with its metadata (say MARC or EAD record) - with many image files (or video/audio files...) for the different pages (or streaming parts...) with many different file-versions per page (like: the master uncompressed version for archive; the high-level compressed version for local access; the low-level compressed version for online access; the thumbnail; etc etc). Well: for that kind of use-case, METS is absolutely the most widely adopted standard for digital metadata (..probably you don't need me to say that; anyhow I found an old but prestigious study that confirms this: *Implementing Preservation Repositories for Digital Materials: Current Practices and Emerging Trends in the Cultural Heritage Community. OCLC/RLG PREMIS Working Group, 2004.* www.oclc.org/research/projects/pmwg/surveyreport.pdf). Please consider also (as a confirmation of the wide adoption of this choice) that the italian Ministry standards for digitization-metadata have been mapped towards METS both for the libraries ( http://www.iccu.sbn.it/opencms/opencms/it/main/standard/pagina_372.html) and for the Museums ( http://www.culturaitalia.it/pico/museiditalia/profiles/mets/MuseiDItalia_METS_profile.html ) > 2) The case of data preservation towards which Inspire is turning - useful > files of experimental data (usually they will be attached to papers but not > necessarily) Talking about preservation: METS, not only is focused on administrative and technical metadata (already aimed to support the storage management), but is also suitable with the PREMIS standard, which is specifically centered on preservation strategies (http://www.loc.gov/standards/premis/louis-2-0.xml) >> FFT should be left for fulltext upload where it serves the purpose > >> perfectly and should be understood as syntactic sugar providing > abbreviated > >> form of a more general upload. > > >It also serves well the case of many documents in many formats attached > >to the same records. I hope all these use cases will still be supported > >through FFTs > > The idea was to provide new mechanism not modifying the existing > capabilities of Bibupload... just stopping to use FFT for objects as > Figures. > > * provide a web handler to access bibdocfiles regardless of them > > being owned by a record (as the > > current /record/123/files/foo.pdf will no longer work for non > > fulltext) (BTW what about restriction/authorization? What if > > bibdoc is referenced both by a public and a restricted record? > > Should we go for the strongest restriction mode?) > > It canstill work, but in a slightly more distant future we might want to > provide /object/123 along with /record/12 > Here (with this ad-hoc selection of quotes) I'd like to forward a big preoccupation of the system analysts I worked with for digital libraries: the complete autonomy of the storage file-system from metadata file-pointers. Digital-metadata like METS are often used as a layer (additional to the descriptive-metadata like MARC) to store file-pointers managed with some resolver (based on algorithms like yours). So that when we are dealing with many many thousands of files per many many TeraBytes, EVERY new accommodation/substitution/refresh of the storage-system is possible … without worrying about the logical (even permanent) pointers to the files. (From METS tutorial: *The LOCTYPE attribute specifies the type of locator contained in body of the element; valid values for LOCTYPE include 'URN,' 'URL,' 'PURL,' 'HANDLE,' 'DOI,' and 'OTHER.'*) Speaking in concrete words: in my experience quite every time I saw, - descriptive-metadata (like MARC) managed on one side with specific (multiple) identifiers (..also modifiable identifiers, in the collaborative systems..) - digital-repositories, on the other side, with specific (stable!!) identifiers for digital-objects and their component files, - and, in the middle, digital-metadata (like METS) which guarantee the connections (regardless the physical file storage). [...Sorry if this observation was too much obvious...] BDR is supposed to provide link between records and objects(document). > In METS (if I understand correctly), they are used only to describe the > internal structure of objects. > I'm not sure if I'm correctly following your arguments, but here I have to suggest just the contrary: METS does provide link between records and objects(document). And internal structure of objects. And technical description of the digital-object... METS is a container which can include or point-to a lot of different layers with respective schemas, also the (MARC) descriptive-metadata. > >> <BibDocRelation bibdoc1="tmp:NewFigure1" version1="1" bibdoc2="12456" > version2="2" type="extracted_from"/> > > >I really like the idea of creating links between specific versions. > >Unfortunately METS is not aware of versions :-( > >> Example: > Versions are crucial for us exactly for the reason You noted in las message > METS does support multiple file versions !! And also in the precise way you mentioned as an example :-) (Probably you already read it, but) let me repeat a quote of METS tutorial ( http://www.loc.gov/standards/mets/METSOverview.v2.html#filegrp): *The file section (<fileSec>) contains one or more <fileGrp> elements used to group together related files. A <fileGrp> lists all of the files which comprise a single electronic version of the digital library object. For example, there might be separate <fileGrp> elements for the thumbnails, the master archival images, the pdf versions, the TEI encoded text versions, etc.* […] *<fileGrp> becomes much more useful for objects consisting of large numbers of scanned page images, or indeed any case where a single version of the object consists of a large number of files. In those cases, being able to separate <file> elements into <fileGrp>s makes identifying the files belonging to a particular version of the document a simple task. * I can surely say that I used MAG (...italian, “mappable” version of METS) to manage multiple images per book, with multiple versions per imagine (master TIFF uncompressed, plus JPG compressed... Indeed: multi-resolution JPG file version … but this is another topic). > >Do you mean by "Main" the identifier of the BibDocRelation? Is the the > >"number of a document to reference" the docid of an existing BibDoc? > > I just mean the same thing as Main means right now -> that a particular > document is main for a given record (ie a fulltext can ba main doc and > extracted from it figures are non-main) > UNFORTUNATELY, METS supports also this: *“METS pointers specify separate METS documents as containing the relevant file information for the <div> containing them. This can be useful when encoding large collections of material (e.g., an entire journal run) […] File pointers specify files ... within the current METS document's <fileSec> section that correspond to the portion in the hierarchy represented by the current <div>*” Where: “*Possible <div> TYPE attribute values include: chapter, article, page, track, segment, section etc. METS places no constraints on the possible TYPE values. Suggestions for controlled vocabularies for TYPE may be found on the METS website.*” But, giving my two cents: this should be business of the descriptive-metadata layer !! If you have a multi-level MARC description (say: one mother-record of a review, with many child-records for the issues) you could, than, produce the METS records for the digitalized parts. > >Moreover I really would dream if integrating METS in your second file > >format would be possible. > > This rises a great dilema :) as You noted, this would really endanger the > timeframe which is rather crucial for me as I should concentrate on other > things. > On the other things, it is obviously beneficial for Invenio to support > standard rather than invent one (unless this is significantly different) > We would though have to extend METS. > I choose this quote of your mail to make an admission: indeed METS risks to be too much flexible and dispersive, and needs some decisions for its support. But it really doesn't lack of already existing extended profiles: please take a look to the actual registered profiles http://www.loc.gov/standards/mets/mets-registered-profiles.html. > What level of support for METS did you have in mind ? Exporting data. > importing in full format, importing in some subset of the format ? > > Maybe METS should not be native to Invenio but we should start with > supporting the possibility to export data in this format ? > Obviously I don't propose replay to these organizational questions. I only tell you what I generally see in commercial solutions for digitization in italian document centers. - Quite every time, digital-metadata come with an asynchronous procedure from the descriptive-metadata population. (For example: book already catalogued, are afterwards digitized) - The technical-data to be automatically generated are not so numerous (md5, file dimensions, etc...): so that It could be interesting to develop an internal procedure in Invenio :-) - Anyhow, in this direction, I saw many times the need of a semi-automatic procedure: a human (web) interface is needed for the input of information like page-name, image-label, etc. - Or, another semi-automatic procedure can be the external creation of half-processed digital-object with multimedia-file and textual information file (about the structure), organized in a conventional way. Then, in the digital library it's needed a procedure which, on the base of a “guide-file” import the half-processed objects, and create the technical-data. (This is the way I worked recently). - Or, I saw digital-metadata completely created in an external system, and imported in the service-provider (on the base of a standard !), but in this case I saw that some decisions became critical, like: - the ID assignment to multimedia-file from an external system - and, consequently, the integration with the system for the physical storage - and, the integration with the online-access interface Thanks for reading this mail (it was only aimed to encourage in this very interesting topic) Cheers, Cristian Bacchi
