Hello ! I have never heard about METS before so today I invested some time to learn about it (Which means, I have some but rather limited knowledge). >From what I have read, I have an impression that the model described by METS >is in some areas much more detailed (For that the one we are using (with >extensions) In some other areas, it is missing concepts we need.
What level of support for METS did you have in mind ? Exporting data. importing in full format, importing in some subset of the format ? Do You think, we could easily implement something that would deal with concepts present there ? - metadata of objects encoded in different formats (it looks like a more general concept of MoreInfo) - relations between different parts of different files (in the examples, they showed how to divide monolith files into chapters and link them between iach other (audio format with transcript format)) Implementing all this sounds like a rather big project for me ... especially that we would have to extend the standard (which would make us again not completely compatible). What do you think about this ? Maybe METS should not be native to Invenio but we should start with supporting the possibility to export data in this format ? > What in the end is the use case of standalone documents? How > can you later search for them? I guess they make sense only if they are > at the same time referenced by at least another document or by a record, >isnt't it? not necessarily. the use cases I know (probably there are more about which Salvatore and Suenje have knowledge) are : 1) (the most obvious for me) - the case of standalone plots showing an important phenomenon. The access to them shall be provided by the figures search. 2) The case of data preservation towards which Inspire is turning - useful files of experimental data (usually they will be attached to papers but not necessarily) >> Of course, it is possible to encode anything in MARC, but it will >> quickly become unreadable and the code implementing encoding/decoding >> of the data will be more error-prone. >> FFT should be left for fulltext upload where it serves the purpose >> perfectly and should be understood as syntactic sugar providing >> abbreviated >> form of a more general upload. >It also serves well the case of many documents in many formats attached >to the same records. I hope all these use cases will still be supported >through FFTs The idea was to provide new mechanism not modifying the existing capabilities of Bibupload... just stopping to use FFT for objects as Figures. >> + Uploading of the documents >> >> Current mechanism for uploading documents to Invenio is very much >> oriented towards managing fulltexts that can belong to only one record. >> It is difficult to extend BibUpload to allow attachments of the same >> BibDoc to many records or to create BibDocs not related to any record >> using the FFT syntax. >This make me think we should not probably use BibUpload to manage >BibDocs (beside keep on with the FFT thing). What we have done >up-to-now, for very complex manipulation of BibDocs (e.g. in case of >WebSubmit), was to do everything with the API, and then send an FFT with >a FIX-MARC to synchronize the 8564_ fields so that they reflect the last >state of the documents. >So you would use it only for documents that are attached to at least one >record, isn't it? Otherwise you should really think of implementing a >tool separated from BibUpload that can act irrespectively of records. Indeed, I was thinking about having one tool only because of the temporary identifeirs described later. Having one tool makes their semantic easier. We do not have to implement complicated scenarios of detecting if a given temporary id is still necessary. Do you see any elegant solution to this problem with separate tasks ? >In particular its "File Section" ><http://www.loc.gov/standards/mets/METSOverview.v2.html#filegrp> would >match the current FFT >and the "Structural Map" ><http://www.loc.gov/standards/mets/METSOverview.v2.html#structmap> >and the "Structural Links" ><http://www.loc.gov/standards/mets/METSOverview.v2.html#structlink> >do really sounds to me as your BDR proposal. BDR is supposed to provide link between records and objects(document). In METS (if I understand correctly), they are used only to describe the internal structure of objects. >> In addition to extending the syntax of BibUpload, the significance of >> internal BibDoc identifiers should be increased. It should be assured >> that the same identifier can not be reused after deletion of a BibDoc. >Are you talking about docnames? I was rather thinking about BibDocId (represented in the database and used internally by some parts of the API) >> +++ <BibDocRelation> >> >> This markup element enables uploading links between BibDocs being uploaded >> to Invenio or already existing >> >> >> Example: >> >> <BibDocRelation bibdoc1="tmp:NewFigure1" version1="1" bibdoc2="12456" >> version2="2" type="extracted_from"/> >I really like the idea of creating links between specific versions. >Unfortunately METS is not aware of versions :-( Versions are crucial for us exactly for the reason You noted in las message >> +++ MoreInfo >Mmh... to store in dynamic tables rather than blobs seems too much >complex than useful. It's a good usecase for using MongoDB and indexing >the JSON representations of MoreInfo :-) if MoreInfo grows, usage of blobs will become increasingly unefficient. In fact, I was talking some time ago with roman about usag of some Key-Value store (he also needed something) ..... and the namespaces in MoreInfo were somehow inspired by column families present in HBase. >> <record> >> <specialfield tag="001">234</specialfield> >> <datafield tag="BDR"> >> <subfield code="a">12</subfield> <!--the identifier of BibDoc --> >> <subfield code="r">number of a document to reference</subfield> >> <subfield code="t">Main</subfield> <!--the identifier of BibDoc --> >> <!-- other subfields characteristic to the relation --> >> </datafield> >> </record> >Do you mean by "Main" the identifier of the BibDocRelation? Is the the >"number of a document to reference" the docid of an existing BibDoc? I just mean the same thing as Main means right now -> that a particular document is main for a given record (ie a fulltext can ba main doc and extracted from it figures are non-main) And yes - the number of existing bibdoc, or temporary id > Example of linking to a document being uploaded in parallel: > > <record> > <specialfield tag="001">234</specialfield> > <datafield tag="BDR"> > <subfield code="a">tmp:NewDocument</subfield> <!--the identifier of > BibDoc --> > <subfield code="r">number of a document to reference</subfield> > <subfield code="t">Main</subfield> <!--the identifier of BibDoc --> > <!-- other subfields characteristic to the relation --> > </datafield> > </record> > > ??? Should we always attach a document or only its particular version ? > (or marking that all versions? ) >As I mentioned before I really think that a link can only be made across >specific versions of bibdoc. yes! This is exactly the use case that was inspiration for relations between versions... BDR link between record and document. BibDocRelation is a different thing - does not involve records, BDR does. BibDocRelation has to be obviously versioned, BDR - this was my question ;) >As a side track, as this is needed also in the context of BibEdit, we >were thinking of decoupling the semantic of an FFT tag from the >--insert/correct/append/delete/replace mode being used in BibUpload. In >the end these modes have a meaning WRT metadata but are a bit confusing >WRT what to do with fulltext. For this reason it might be nice to >officialize a subfield in the FFT to put the actual "command" to perform >(i.e. append/revise/delete) a bit like today is done with the $t. makes sense but requires a little of conceptual work -> this would involve many special cases and situations with non-obviously clear semantics. >> +++ A larger example - Uploading of two new BibDoc and their attachment >Your copyright example makes me even more thing about METS! (see the >Administrative Metadata section). ><http://www.loc.gov/standards/mets/METSOverview.v2.html#admMD> Indeed... I just wanted to ilustrate that other modules (someone was roking on exactly this for Invenio ? ) fit into this scheme... moreover, I wanted to have a piece of moreinfo that wold have to be attached to a particular file. >> (...) >So if I am well understanding, you are really proposing to specify two >files at the same time with BibUpload. One with MARC and if this one >contains BDR tags rather than FFT, the second file is consulted. Is this >correct? almost. the second file should be provided regardless the usage of BDR... in particular we might upload things that are not linked to records or only update existing obejcts. >Maybe it might be the case to even put it in a wiki? As You might have noticed, I was trying to format it wiki-like, but decided to put it here first and later revise after having comments and then put into WIKI >In the end, on a side track, it would be really nice to refactor these >class structure not to only represent bibdocs on the filesystem, but >e.g. to be able to offer the same interface for URLs referenced in the >MARC (in 8564 tags), so that it could become transparent to manage >resources attached to records, regardless of them being on filesystem or >remote. Similar side track would be to be able to have a class for >transient bibdocs (e.g. wrapping temporary files on disk), that are not >archived in the final structure of the filesystem. Indeed this can be >done in general by assuming a bibdoc is not necessarily attached to a >record. What is the usecase that would benefit from this ? I can not see it. >> - Automatic transformation of MoreInfo into dynamic database tables. >If really needed... I would feel better with a key-vale store ;) >Overall, if I well understood everything there is a *lot* to change and >improve and extend in the bibdocfile framework: > * move back the identifiers of bibdocs from docnames back to > docids which are guaranteed to be unique WRT the whole > installation we can use existing bibdocid, just make them more visible. names are ok, but are local. > * provide a web handler to access bibdocfiles regardless of them > being owned by a record (as the > current /record/123/files/foo.pdf will no longer work for non > fulltext) (BTW what about restriction/authorization? What if > bibdoc is referenced both by a public and a restricted record? > Should we go for the strongest restriction mode?) It canstill work, but in a slightly more distant future we might want to provide /object/123 along with /record/12 > * add moreinfo everywhere correctly yes > * assure bibdocfile CLI tools still work yes, but it is based on FFT which is not supposed to be broken, so this part would involve at most extending bibdocfile CLI > * add support for BDR and new file to BibUpload yes >Moreover I really would dream if integrating METS in your second file >format would be possible. This rises a great dilema :) as You noted, this would really endanger the timeframe which is rather crucial for me as I should concentrate on other things. On the other things, it is obviously beneficial for Invenio to support standard rather than invent one (unless this is significantly different) We would though have to extend METS. >(...) I will have a look at these tickets btw... what do you think about JSON in XML ? Do you think, we should go for some encoding of everything in XML or rather do like here ? Cheers Piotr > (I am struggling with weird problems with regression tests... was > BibRecDocsTest.test_BibRecDocs ever passing? For my taste the test > requests incorrect file sizes ... and indeed it fails on my machine) Yes! It was always passing. Indeed the bibdocfile tests need to be refactored as they are too monolithic (they were done in a month by a child of staff), and a failure at the beginning of the test will cause several tenths of other small test to fail. -- Samuele Kaplun Invenio Developer ** <http://invenio-software.org/>
