RE: [inspire-dev] Limitations with having standalone BibDocs

Samuele Kaplun Fri, 10 Jun 2011 11:42:40 +0200

Hi Piotr!

Il giorno gio, 09/06/2011 alle 17.21 +0200, Piotr Praczyk ha scritto:
> As I mentioned, I have a branch containing implementation of some new
> features of BibDoc - mostly extension of the MoreInfo notion and
> introduction of relations between records.


can you push this branch public? I am really interested in seeing its
evolution.

>   - Using BibUpload "MARC" is not a very clean solution for uploading
>     non-record data such as standalone documents.

I agree. What in the end is the use case of standalone documents? How
can you later search for them? I guess they make sense only if they are
at the same time referenced by at least another document or by a record,
isnt't it?

>     In order to be compliant with the current BibUpload format, We would
>     have to include a FFT tag inside a record tag.
>     This would have to be interpretet by BibUpload as NOT modifying or
>     uploading any record.

It's true that FFT fields are not metadata but are instructions to
manipulate documents. In the end, we might simply extend BibUpload not
to create a new revision of the metadata if nothing changes. The thing
is that in BibUpload there is an algorithm that will always keep in sync
any change to documents and the corresponding 8564_ fields. In practice
these might change only in few situations (e.g. document format added or
deleted, comment changed, description changed, icon changed).

>    - Packing data of a more complicated structure into "MARC" is 
> non-intuitive.

This is a recurring issue. We should move Invenio to use JSON
everywhere! But this is will be the subject of another thread :-)

>      Of course, it is possible to encode anything in MARC, but it will
>      quickly become unreadable and the code implementing encoding/decoding
>      of the data will be more error-prone.
>      FFT should be left for fulltext upload where it serves the purpose
>      perfectly and should be understood as syntactic sugar providing 
> abbreviated 
>      form of a more general upload.

It also serves well the case of many documents in many formats attached
to the same records. I hope all these use cases will still be supported
through FFTs

>    - FFT stands for Fulltext File Transfer. Using it for non-fulltext
>      docuemtns leads to confusion. 

Let's rename it to Fast File Transfer, then :-)

>    - The BibUpload "MARC" obfuscates the way of thinking about documents.
>      The internal structure of documents and relations among them (and 
> relations
>      between documents and records) is not reflected in the structure of the 
> FFT field.

Right!

>      Portions of information from different subfields land in completely
>      different database entities.

Right!

> + Uploading of the documents
> 
> Current mechanism for uploading documents to Invenio is very much
> oriented towards managing fulltexts that can belong to only one record.
> It is difficult to extend BibUpload to allow attachments of the same
> BibDoc to many records or to create BibDocs not related to any record
> using the FFT syntax. 

This make me think we should not probably use BibUpload to manage
BibDocs (beside keep on with the FFT thing). What we have done
up-to-now, for very complex manipulation of BibDocs (e.g. in case of
WebSubmit), was to do everything with the API, and then send an FFT with
a FIX-MARC to synchronize the 8564_ fields so that they reflect the last
state of the documents.

> It is also difficult to provide uploading of
> relations between existing BibDocs. This is because MARC provides
> method of encoding tree-structured data with the  maximum nesting depth
> of 2. (1 in the case of special fields). 

Indeed :-(

> Data structures that need to be uploaded to Invenio are graphs.
> BibUpload is the only gateway for uploading data to Invenio. For the
> sake of uniformity, it should remain this way.

Well it's the only gateway for metadata, but as said before, it is
already happening today that documents are manipulated asynchronously
and FFT FIX-MARC is called afterwards to commit the changes into MARC.

> I believe, the easiest and the most efficient way of adapting BibUpload
> to extended usage of BibDocs and file attachments is prviding a
> non-MARC XML based additional input that could be processed by
> BibUpload. The existing FFT (Fulltext File Transfer) field should be
> preserved and utilised ONLY when uploading fulltext documents. FFT
> should be understood as a convenient abbreviated syntax allowing a
> limited functionality of new XML syntax. In addition to FFT, a new
> "MARC" field could be introduced - BDR (BibDoc Reference). The purpose
> of this field would be to introduce a link between uploaded (or
> modified) record and an existing BibDoc. This would enable multiple
> relationship between records and documents which is currently
> explicitly blocked in the code of BibDoc class but theoretically
> allowed in the database structure.

So you would use it only for documents that are attached to at least one
record, isn't it? Otherwise you should really think of implementing a
tool separated from BibUpload that can act irrespectively of records.

BTW, the BDR suggestion to me, really sounds like METS.

<http://www.loc.gov/standards/mets/>

If you are going that far (and it make perfect sense to me), than it
might be worth going one step further (of course this will clash with
the timings of the Inspire week), and support METS in input (at least
for some part of it). METS is an XML-based standard of the Library of
Congress to represent digital objects base:

In particular its "File Section"
<http://www.loc.gov/standards/mets/METSOverview.v2.html#filegrp> would
match the current FFT

and the "Structural Map"
<http://www.loc.gov/standards/mets/METSOverview.v2.html#structmap> 

and the "Structural Links"
<http://www.loc.gov/standards/mets/METSOverview.v2.html#structlink>

do really sounds to me as your BDR proposal.

> In addition to extending the syntax of BibUpload, the significance of
> internal BibDoc identifiers should be increased. It should be assured
> that the same identifier can not be reused after deletion of a BibDoc.

Are you talking about docnames? Currently these are supposed to be
unique WRT a record. And of course this works with fulltextes. The day
we are going to have the same document referenced by more than one
record this restriction will become an issue.

Infact this restriction was added in order to have well formed URL to be
used to retrieve the document:
http://example.org/record/123/files/the-docname.pdf

However if your document is referenced by more records it will be no
longer feasible to assure that the docname is unique among all the
records that reference it. It's more like the docname should as well
become a property of the link between a record and a document. A
document by it self will always have unique ID in the form of its docid
(and they are not going to be re-used in case of deletion).

> ++ Syntax of the input encoding new elements
> 
> The additional input of BibUpload should be provided as an additional
> file containing following tags:

What do you mean by additional file? If this is not in the input
MARCXML, than really we are in the case in which it would be a good idea
to have a new bibtask to manipulate files.

> <BibDoc/>
> 
> <BibDocRelation/>
> 
> +++ <BibDoc>
> This tag allows uplading of a document that will be managed by the 
> installation of Invenio.
> 
> The main subfield of BibDoc is File allowing to attach a file in a particular 
> format representing the BibDoc.
> 
>    <File format=".jpg" path="/tmp/some_figure.jpg" />

It really looks more and more as METS :-)

> If the format is not specified, it is guessed based on the file extension.
> 
> Before upload phase, the identifier that will be assigned to the
> document by Invenio is not known. 

In principle, if all the file manipulation were synchronous these could
have been pre-computed. But it's much nicer to assume, as you propose,
that this identifier will be decided by BibUpload/Whatever as a
black-box.

> Input passed to BibUpload should be
> able to include relations between uploaded documents and links from
> documents to records. This can be achieved by specifying temporary
> document identifiers. BibDoc XML tag may specify the id property. Its
> value can be equal eather to Invenio-assigned identifier (in this case,
> corresponding BibDoc can be updated) or a demporary identifier
> (prefixed with the "tmp:" string). The temporary identifier is
> recognised by BibUpload at least within the same BibUpload session and
> allows to reference a particular BibDoc from XML elements describing
> different entities. During the upload process, temporary identifiers
> are replaced with newly assigned Invenio identifiers.

Nice idea!

> +++ <BibDocRelation>
> 
> This markup element enables uploading links between BibDocs being uploaded to 
> Invenio or already existing
> 
> 
> Example:
> 
> <BibDocRelation bibdoc1="tmp:NewFigure1" version1="1" bibdoc2="12456" 
> version2="2" type="extracted_from"/>

I really like the idea of creating links between specific versions.
Unfortunately METS is not aware of versions :-(

> +++ MoreInfo
> 
> Each of there element (BibDoc, File, BibDocRelation) can contain
> definition of MoreInfo which contains additional pieces of information
> divided into namespaces and having a key, value format. (Namespaces -
> additional level of dictionary allowing to group similar key,value
> pairs) are intended to minimize possibility of conflicts between
> different modules utilising the same MoreInfo infrastructure. It will
> also be useful when adapting MoreInfo to store data in separate
> database tables rather than in a blob. (Should we proceed with this
> soon ?)

Mmh... to store in dynamic tables rather than blobs seems too much
complex than useful. It's a good usecase for using MongoDB and indexing
the JSON representations of MoreInfo :-)

> <MoreInfo>
>   <element category="plots" key="references" encoding="JSON">
>     <![CDATA[
>       [
>         {
>           "text": "In Figure 1 we can see the difference between (...)",
>           "position": 1123
>         },
>         {
>           "text":"(...) The results of the experiment are illustrated in 
> Figure 1 (...)",
>           "position": 256
>         }
>       ]
>     !]>
>   </element>
>   <element category="plots" key="x">10</element>
>   <element category="plots" key="y">20</element>
>   <element category="plots" key="width">600</element>
>   <element category="plots" key="height">400</element>
>   <element category="plots" key="caption">This is a caption of the 
> figure</element>
>   <!-- and some other properties assigned by a diferent module
>      - for instance the access control or general use flags -->
>   <element category="general" key="flags">abcds</element>
>   <element category="general" key="visibility">HIDDEN</element>
> </MoreInfo>
> 
> Elements of MoreInfo (addressed by category and key) can be either
> strings or JSON-encoded more complicated value. Usage of JSON is
> slightly clumsy in the context of XML which itself provides data
> encoding, but seems to be the simplest solution. We are using JSON in
> many places already and it seems natural for representation of data.
> Another solution would be to replace JSON with some type of XML
> encoding (we would have to encode for exampel lists) or to replace the
> additional BibUpload input entirely by JSON.

> +++ Attaching documents to records
> 
> The FFT (Fulltext File Transfer) "MARC" tag allowing to upload
> documents and attach them to the publication is not flexible enough to
> allow attaching the same docuemtn to many records or to allow upload of
> relations between documents. It is though a very convenient manner of
> uploading documents that are full texts so by the nature are attached
> (at least initially) to only one record. This syntax should be
> preserved but its usage should be limited to fulltexts. The semantics
> of FFT should be understood as an abbreviated form of uploading
> particular type of BibDocs.

Make sense.

> Besides FFT, we should provide one more special "MARC" tag BDR (or
> other name) - BibDoc Reference which could create link between
> modified/uploaded record and a BibDoc. Subfields of the FFT tag should
> contain all pieces of information characteristic to the link between
> record and BibDoc. Such information include for example type of BibDoc
> (one BibDoc may be the Main document of one record while only a figure
> in another).
> 
> Example of linking to an existing BibDoc:
>   <record>
>     <specialfield tag="001">234</specialfield>
>     <datafield tag="BDR">
>       <subfield code="a">12</subfield> <!--the identifier of BibDoc -->
>       <subfield code="r">number of a document to reference</subfield>
>       <subfield code="t">Main</subfield> <!--the identifier of BibDoc -->
>       <!-- other subfields characteristic to the relation -->
>     </datafield>
>   </record>

Do you mean by "Main" the identifier of the BibDocRelation? Is the the
"number of a document to reference" the docid of an existing BibDoc?

> Example of linking to a document being uploaded in parallel:
> 
>   <record>
>     <specialfield tag="001">234</specialfield>
>     <datafield tag="BDR">
>       <subfield code="a">tmp:NewDocument</subfield> <!--the identifier of 
> BibDoc -->
>       <subfield code="r">number of a document to reference</subfield>
>       <subfield code="t">Main</subfield> <!--the identifier of BibDoc -->
>       <!-- other subfields characteristic to the relation -->
>     </datafield>
>   </record>
> 
> ??? Should we always attach a document or only its particular version ?
> (or marking that all versions? )

As I mentioned before I really think that a link can only be made across
specific versions of bibdoc. Imagine the case where you a
fulltextA;version1 from which you extract figureA;version1 and
figureB;version1.

If you then revise fulltextA twice, (for any reason), and then you
re-extract the figures, you will end up having a relation between
fulltextA;version3 and figureA;version2 and figureB;version2.

Following the same philosophy of the current BibDocFile framework, where
different formats of the same document are aligned by version (and this
is very visible in the /files panel. If you have fulltext.pdf;1 and
fulltext.doc;1, and you revise it with fulltext.doc;2, the
fulltext.pdf;1 becomes somehow hidden as it would no longer correctly
represent the latest revision of the document).

> The behaviour of all proposed extensions should be uniform with current
> behaviour of BibUpload when workin in insert,update,append and correct
> modes.

As a side track, as this is needed also in the context of BibEdit, we
were thinking of decoupling the semantic of an FFT tag from the
--insert/correct/append/delete/replace mode being used in BibUpload. In
the end these modes have a meaning WRT metadata but are a bit confusing
WRT what to do with fulltext. For this reason it might be nice to
officialize a subfield in the FFT to put the actual "command" to perform
(i.e. append/revise/delete) a bit like today is done with the $t.

> +++ A larger example - Uploading of two new BibDoc and their attachment
> to two existing records and marking that they are extracted from a
> fulltext document of the given record.
> 
> In this example we assume that 576 is the identifier of the fulltext bibdoc
> corresponding to the updated record.
> 
> The additional BibUpload input file:
> 
> <BibDoc id="tmp:NewFigure1">
>     <File format=".png" path="/tmp/figure.png"/>
>     <File format=".jpg" path="/tmp/figure.jpg"/>
>     <File format=".svg" path="/tmp/figure.svg">
>       <MoreInfo>
>         <!-- here for example information, encoded by a different module that 
> this file can not be published because of copyright problems (just an 
> example) -->
>       </MoreInfo>
>     </File>
> 
>     <MoreInfo>
>       <!-- in this example we upload only the text present inside a figure as 
> an example of metadata fitting at this place -->
>       <element category="plots" key="internal_text">\tau neutrino NCGS axis 
> (...)</element>
>     </MoreInfo>
> </BibDoc>

Your copyright example makes me even more thing about METS! (see the
Administrative Metadata section).
<http://www.loc.gov/standards/mets/METSOverview.v2.html#admMD>


> <BibDoc id="tmp:NewFigure2">
>     <File format=".png" path="/tmp/figure2.png"></File>
>     <File format=".jpg" path="/tmp/figure2.jpg"></File>
>     <!-- additional MoreInfo descriptions and other pieces of MetaData-->
> </BibDoc>
> 
> <!-- the description of the relation between new BibDoc describing figure and 
> the
>      existing FullText document saved in a BibDoc 576, version 1.
>      This relation does not depend on format. -->
> 
> <BibDocRelation bibdoc1="tmp:NewFigure1" version1="tmp:NewFigure1:lastver" 
> bibdoc2="576" version2="1" type="extracted_from">
>   <MoreInfo>
>     <element category="plots" key="references" encoding="JSON">
>       <![CDATA[
>         [
>           {
>             "text": "In Figure 1 we can see the difference between (...)",
>             "position": 1123
>           },
>           {
>             "text":"(...) The results of the experiment are illustrated in 
> Figure 1 (...)",
>             "position": 256
>           }
>         ]
>       !]>
>     </element>
>     <element category="plots" key="page">2</element>
>     <element category="plots" key="x">10</element>
>     <element category="plots" key="y">20</element>
>     <element category="plots" key="width">600</element>
>     <element category="plots" key="height">400</element>
>     <element category="plots" key="caption">This is a caption of the 
> figure</element>
>     <!-- and some other properties assigned by a diferent module
>      - for instance the access control or general use flags -->
>     <element category="general" key="flags">abcds</element>
>     <element category="general" key="visibility">HIDDEN</element>
>   </MoreInfo>
> </BibDocRelation>
> 
> <BibDocRelation bibdoc1="tmp:NewFigure2" version1="tmp:NewFigure1:lastver" 
> bibdoc2="576" version2="1" type="extracted_from">
>    Here additional properties similarly to the 1st example
> </BibDocRelation>
> 
> <BibDocRelation bibdoc1="tmp:NewFigure2" version1="tmp:NewFigure2:lastver" 
> bibdoc2="tmp:NewFigure1" version2="tmp:NewFigure2:lastver" 
> type="is_subfigure_of">
>   <MoreInfo>
>     <!-- some data here -->
>   </MoreInfo>
> </BibDocRelation>
> 
> The "MARC" file:
> 
>   <record>
>     <specialfield tag="001">234</specialfield>
>     <datafield tag="BDR">
>       <subfield code="a">tmp:NewFigure1</subfield> <!--the identifier of 
> BibDoc -->
>       <subfield code="t">Figure</subfield> <!--the identifier of BibDoc -->
>       <!-- other subfields characteristic to the relation -->
>     </datafield>
> 
>     <datafield tag="BDR">
>       <subfield code="a">tmp:NewFigure2</subfield> <!--the identifier of 
> BibDoc -->
>       <subfield code="t">Figure</subfield> <!--the identifier of BibDoc -->
>       <!-- other subfields characteristic to the relation -->
>     </datafield>
>   </record>

So if I am well understanding, you are really proposing to specify two
files at the same time with BibUpload. One with MARC and if this one
contains BDR tags rather than FFT, the second file is consulted. Is this
correct?

> Thank You for reading this rather long e-mail. There are still some
> issues that have not been tackled here, but their solution is not as
> burning as this one. Here I provide just a short list of them.

Thank you for writing it. I didn't expected it would have been so
long :-) Maybe it might be the case to even put it in a wiki? But it's
true that by keeping it in the mailing list the other developers and
fellows might contribute more easily.

> - uniformity of data models on different levels. (This is not an error
> but leads to more complicated code).   We have three different points
> op view on BibDocs/BibDocFiles/versions   One is implemented in the
> Python API (API layer), the second one in the database and file-system
> (storage layer) and a completely different one in the presentation
> layer (/files pages of a record) - this might be confusing and probably
> should be unified a little.

>  - A little more explicit version management. Maybe I am wrong, but it
> feels a little uncomfortable to have a database coumn refering to the
> entity that is encoded only in the file name stored in the file
system.
> (bibdoc version).
> 
Indeed you might think of files as represented as:

id/version/format although in the filesystem they are then stored as:

id/docname.format;version

Today the docname plays an important role in identyfing the document
(and it's even enforced in the filesystem). But nothing prevent us to
make it a property of the link between the record and document, just for
the very sake of having meaningful and nice URLs to download them
(rather than the arid:

</getfile?id=123&format=.pdf&version2>
)
The version is used in the MoreInfo blob only, but of course, as soon as
you put the MoreInfo at everylevel of the framework, then it makes
perfect sense to have new tables:

<docid, moreinfo>
<docid, version, moreinfo>
<docid, version, format, moreinfo>
<docid, docid, moreinfo>
<recid, docid, moreinfo>

In practice any combination is worth having it.

> - BibDoc Python class in fact reflects needs of FullTexts. It should
> probably be stripped from functions that are typical to FT treatment.
> They should be moved to a subclass. (ie functions extracting fulltext)

Well it represent a document. Could you be more specific about which
functions are only related to fulltext?

In the end, on a side track, it would be really nice to refactor these
class structure not to only represent bibdocs on the filesystem, but
e.g. to be able to offer the same interface for URLs referenced in the
MARC (in 8564 tags), so that it could become transparent to manage
resources attached to records, regardless of them being on filesystem or
remote. Similar side track would be to be able to have a class for
transient bibdocs (e.g. wrapping temporary files on disk), that are not
archived in the final structure of the filesystem. Indeed this can be
done in general by assuming a bibdoc is not necessarily attached to a
record.

> - Automatic transformation of MoreInfo into dynamic database tables.

If really needed...

> - Behaviour of bibdocs upon update - currently there is a new version
> every time we change something. Maybe there should be a new version
> only if we change the file and not meta-data ?

? This is not true. What version are you talking about? Of course if you
are modifying a description, this must be reflected in the MARC (in the
8564_ tag) and hence the MARC should be updated receiving a new
description. Revision to files are only added if you want to add a new
physical files. If you use e.g. FFT or the bibdocfile API just to change
any property, (in the case of FFT, you can simply not specifying a $a),
then no new version will be added.

Overall, if I well understood everything there is a *lot* to change and
improve and extend in the bibdocfile framework:

      * move back the identifiers of bibdocs from docnames back to
        docids which are guaranteed to be unique WRT the whole
        installation
      * provide a web handler to access bibdocfiles regardless of them
        being owned by a record (as the
        current /record/123/files/foo.pdf will no longer work for non
        fulltext) (BTW what about restriction/authorization? What if
        bibdoc is referenced both by a public and a restricted record?
        Should we go for the strongest restriction mode?)
      * add moreinfo everywhere correctly
      * assure bibdocfile CLI tools still work
      * add support for BDR and new file to BibUpload

Moreover I really would dream if integrating METS in your second file
format would be possible. It would be really a pity to go so far in
extending, but not enough to support METS. But this would surely not be
possible for you quick prototype for early June, that, I fear, as every
prototype, will stay forever :-)

Overall I think it would be great if, while you develop you can take in
consideration (at least to be ready for future implementation) these
tickets:

#655: enhancement: BibDocFile-level restriction (new)
#605: enhancement: Permalink-like support for URLs referenced in record
(new)

Cheers!
        Sam

P.s. 

> (I am struggling with weird problems with regression tests... was
> BibRecDocsTest.test_BibRecDocs ever passing? For my taste the test
> requests incorrect file sizes ... and indeed it fails on my machine)

Yes! It was always passing. Indeed the bibdocfile tests need to be
refactored as they are too monolithic (they were done in a month by a
child of staff), and a failure at the beginning of the test will cause
several tenths of other small test to fail.


-- 
Samuele Kaplun
Invenio Developer ** <http://invenio-software.org/>

RE: [inspire-dev] Limitations with having standalone BibDocs

Reply via email to