Hi Sam and Piotr,
Your exchange is extremely interesting !!!
Would you like to receive a feedback from a different field of application?
I speak absolutely with no reference to the work for the deadlines you
mention, but only in the perspective of possible compliance with METS.
So: selecting some important points in your mails...

On Sat, Jun 11, 2011 at 2:03 PM, Piotr Praczyk <[email protected]>wrote:

> > What in the end is the use case of standalone documents? How
> > can you later search for them? I guess they make sense only if they are
> > at the same time referenced by at least another document or by a record,
> >isnt't it?
>
> not necessarily. the use cases I know (probably there are more about which
> Salvatore and Suenje have knowledge) are :
> 1) (the most obvious for me) - the case of standalone plots showing an
> important phenomenon. The access to them shall be provided by the figures
> search.
>

I propose the benchmark of document digitization use-case: let's think about
- one document with its metadata (say MARC or EAD record)
- with many image files (or video/audio files...) for the different pages
(or streaming parts...) with many different file-versions per page (like:
the master uncompressed version for archive; the high-level compressed
version for local access; the low-level compressed version for online
access; the thumbnail; etc etc).

Well: for that kind of use-case, METS is absolutely the most widely adopted
standard for digital metadata (..probably you don't need me to say that;
anyhow I found an old but prestigious study that confirms this: *Implementing
Preservation Repositories for Digital Materials: Current Practices and
Emerging Trends in the Cultural Heritage Community. OCLC/RLG PREMIS Working
Group, 2004.* www.oclc.org/research/projects/pmwg/surveyreport.pdf).

Please consider also (as a confirmation of the wide adoption of this choice)
that the italian Ministry standards for digitization-metadata have been
mapped towards METS both for the libraries (
http://www.iccu.sbn.it/opencms/opencms/it/main/standard/pagina_372.html) and
for the Museums (
http://www.culturaitalia.it/pico/museiditalia/profiles/mets/MuseiDItalia_METS_profile.html
)


> 2) The case of data preservation towards which Inspire is turning - useful
> files of experimental data (usually they will be attached to papers but not
> necessarily)


Talking about preservation: METS, not only is focused on administrative and
technical metadata (already aimed to support the storage management), but is
also suitable with the PREMIS standard, which is specifically centered on
preservation strategies (http://www.loc.gov/standards/premis/louis-2-0.xml)

>>      FFT should be left for fulltext upload where it serves the purpose
> >>      perfectly and should be understood as syntactic sugar providing
> abbreviated
> >>     form of a more general upload.
>
> >It also serves well the case of many documents in many formats attached
> >to the same records. I hope all these use cases will still be supported
> >through FFTs
>
> The idea was to provide new mechanism not modifying the existing
> capabilities of Bibupload... just stopping to use FFT for objects as
> Figures.



> >      * provide a web handler to access bibdocfiles regardless of them
> >        being owned by a record (as the
> >        current /record/123/files/foo.pdf will no longer work for non
> >        fulltext) (BTW what about restriction/authorization? What if
> >        bibdoc is referenced both by a public and a restricted record?
> >        Should we go for the strongest restriction mode?)
>
> It canstill work, but in a slightly more distant future we might want to
> provide /object/123 along with /record/12
>

Here (with this ad-hoc selection of quotes) I'd like to forward a big
preoccupation of the system analysts I worked with for digital libraries:
the complete autonomy of the storage file-system from metadata
file-pointers.
Digital-metadata like METS are often used as a layer (additional to the
descriptive-metadata like MARC) to store file-pointers managed with some
resolver (based on algorithms like yours). So that when we are dealing with
many many thousands of files per many many TeraBytes, EVERY new
accommodation/substitution/refresh of the storage-system is possible …
without worrying about the logical (even permanent) pointers to the files.
(From METS tutorial: *The LOCTYPE attribute specifies the type of locator
contained in body of the element; valid values for LOCTYPE include 'URN,'
'URL,' 'PURL,' 'HANDLE,' 'DOI,' and 'OTHER.'*)

Speaking in concrete words: in my experience quite every time I saw,
- descriptive-metadata (like MARC) managed on one side with specific
(multiple) identifiers (..also modifiable identifiers, in the collaborative
systems..)
- digital-repositories, on the other side, with specific (stable!!)
identifiers for digital-objects and their component files,
- and, in the middle, digital-metadata (like METS) which guarantee the
connections (regardless the physical file storage).

[...Sorry if this observation was too much obvious...]

BDR is supposed to provide link between records and objects(document).
> In METS (if I understand correctly), they are used only to describe the
> internal structure of objects.
>

I'm not sure if I'm correctly following your arguments, but here I have to
suggest just the contrary: METS does provide link between records and
objects(document). And internal structure of objects. And technical
description of the digital-object...
METS is a container which can include or point-to a lot of different layers
with respective schemas, also the (MARC) descriptive-metadata.


> >> <BibDocRelation bibdoc1="tmp:NewFigure1" version1="1" bibdoc2="12456"
> version2="2" type="extracted_from"/>
>
> >I really like the idea of creating links between specific versions.
> >Unfortunately METS is not aware of versions :-(
> >> Example:
> Versions are crucial for us exactly for the reason You noted in las message
>

METS does support multiple file versions !! And also in the precise way you
mentioned as an example :-)
(Probably you already read it, but) let me repeat a quote of METS tutorial (
http://www.loc.gov/standards/mets/METSOverview.v2.html#filegrp):
*The file section (<fileSec>) contains one or more <fileGrp> elements used
to group together related files. A <fileGrp> lists all of the files which
comprise a single electronic version of the digital library object. For
example, there might be separate <fileGrp> elements for the thumbnails, the
master archival images, the pdf versions, the TEI encoded text versions,
etc.*
[…]
*<fileGrp> becomes much more useful for objects consisting of large numbers
of scanned page images, or indeed any case where a single version of the
object consists of a large number of files. In those cases, being able to
separate <file> elements into <fileGrp>s makes identifying the files
belonging to a particular version of the document a simple task.
*
I can surely say that I used MAG (...italian, “mappable” version of METS) to
manage multiple images per book, with multiple versions per imagine (master
TIFF uncompressed, plus JPG compressed... Indeed: multi-resolution JPG file
version … but this is another topic).


> >Do you mean by "Main" the identifier of the BibDocRelation? Is the the
> >"number of a document to reference" the docid of an existing BibDoc?
>
> I just mean the same thing as Main means right now -> that a particular
> document is main for a given record (ie a fulltext can ba main doc and
> extracted from it figures are non-main)
>

UNFORTUNATELY, METS supports also this: *“METS pointers specify separate
METS documents as containing the relevant file information for the <div>
containing them. This can be useful when encoding large collections of
material (e.g., an entire journal run) […] File pointers specify files ...
within the current METS document's <fileSec> section that correspond to the
portion in the hierarchy represented by the current <div>*”
Where: “*Possible <div> TYPE attribute values include: chapter, article,
page, track, segment, section etc. METS places no constraints on the
possible TYPE values. Suggestions for controlled vocabularies for TYPE may
be found on the METS website.*”

But, giving my two cents: this should be business of the
descriptive-metadata layer !! If you have a multi-level MARC description
(say: one mother-record of a review, with many child-records for the issues)
you could, than, produce the METS records for the digitalized parts.


> >Moreover I really would dream if integrating METS in your second file
> >format would be possible.
>
> This rises a great dilema :) as You noted, this would really endanger the
> timeframe which is rather crucial for me as I should concentrate on other
> things.
> On the other things, it is obviously beneficial for Invenio to support
> standard rather than invent one (unless this is significantly different)
> We would though have to extend METS.
>

I choose this quote of your mail to make an admission: indeed METS risks to
be too much flexible and dispersive, and needs some decisions for its
support. But it really doesn't lack of already existing extended profiles:
please take a look to the actual registered profiles
http://www.loc.gov/standards/mets/mets-registered-profiles.html.



> What level of support for METS did you have in mind ? Exporting data.
> importing in full format, importing in some subset of the format ?
>


>   Maybe METS should not be native to Invenio but we should start with
> supporting the possibility to export data in this format ?
>

 Obviously I don't propose replay to these organizational questions.
I only tell you what I generally see in commercial solutions for
digitization in italian document centers.
- Quite every time, digital-metadata come with an asynchronous procedure
from the descriptive-metadata population. (For example: book already
catalogued, are afterwards digitized)
- The technical-data to be automatically generated are not so numerous (md5,
file dimensions, etc...): so that It could be interesting to develop an
internal procedure in Invenio :-)
- Anyhow, in this direction, I saw many times the need of a semi-automatic
procedure: a human (web) interface is needed for the input of information
like page-name, image-label, etc.
- Or, another semi-automatic procedure can be the external creation of
half-processed digital-object with multimedia-file and textual information
file (about the structure), organized in a conventional way. Then, in the
digital library it's needed a procedure which, on the base of a “guide-file”
import the half-processed objects, and create the technical-data. (This is
the way I worked recently).
- Or, I saw digital-metadata completely created in an external system, and
imported in the service-provider (on the base of a standard !), but in this
case I saw that some decisions became critical, like:
- the ID assignment to multimedia-file from an external system
- and, consequently, the integration with the system for the physical
storage
- and, the integration with the online-access interface

Thanks for reading this mail (it was only aimed to encourage in this very
interesting topic)

Cheers,
Cristian Bacchi

Reply via email to