> For those models with corrupt metadata you probably want to regenerate
> the RDF from the original model that was(is?) in CVS (I have an
> archive of those raw files if they are 'lost').
> 
> Can you identify what is actually corrupt and write a small proposal
> to fix it (which may involve the above?). I see this as being more
> critical than anything since we don't particularly want people
> downloading and trying to update essentially corrupt data.
> 

Okay, here it is.

*Caused by Repository and 4Suite, but repairable by script I wrote:*
- 4Suite's way of naming blank nodes, I doubt it should assign an http
  uri for that
  - fixed by renaming the url 'http://4suite.org/rdf/anonymous/' with
    'rdf:#'
- 4Suite uses rdf:ID for rdf:Seq identifiers where rdf:about should be
  used.
  - can be repaired by search/replace rdf:ID="http://4suite.org/....";
    with rdf:about="http://4suite.org/....";
  - or rdf:ID="rdf:#" with rdf:about="rdf:#"
  - or patch 4Suite.
  - this gets real fun when PCEnv tries to use that ID with the xml:base
    and then pushing it back into 4Suite, you start getting really long
    URIs that don't make sense.
- cmeta:id renaming was not done properly, resulting in id mismatch
  between the graph and the CellML model object (thus proper querying
  becomes impossible).
  - undo the renaming of cmeta:id, where possible.
  - It really shouldn't be renamed in the first place.
  - can be impossible to fix, see next section.
- cmeta:id renaming means recreation of many RDF nodes, and resource
  nodes were recreated as literal nodes, resulting in broken graphs.
  - regex replace of '>rdf:#....</ns:tag>' with
    ' rdf:resource="rdf:#..."/>
- rdf:type are resources and not literals
  - regex replace
- xml:base mishandling
  - RDF library issues, can be worked around
  - the CellML metadata library drops them when possible during
    serialization.

The code that addresses the above issues currently resides in the
CellMLMetadata Library

https://svn.physiomeproject.org/svn/physiome/CellMLMetadata/trunk/CMLmetadata.py

Search for 'def repairRdf' and 'def repair4Rdf' for the logic I use.

Feel free to critique the code, there are various shortcomings (which
may or may not be documented in that file).


*Definitely impossible to fix, as data are truly missing:*
- Incorrectly coded RDF
  - the way the rdf was written originally was more easy for humans to
    read, and so certain nodes required attribute rdf:parseType="Resource"
    did not have them, resulting in graphs not being parsed properly.
  - Thus those nodes are not generated to be in the graph.
  - Only solution is to fix them and reupload them, or add in the missing
    nodes.
- Programs that truncates data due to bugs or poor programming
  - PCEnv at one point did not save rdf:Seq properly due to a bug in the
    Mozilla's RDF library.
  - Model Repository did not handle multiple JournalArticles properly,
    and it collapses all lists of authors from all citations into one in
    an undefined method (what the Python interpreter felt like doing at
    that particular time of day).
- Original cmeta:id is definitely lost in almost all cases.
  - Model Repository used to rename/replace them.

Lost data is lost.

It is possible to go through each model one by one and fix what is broken,
and grab the data from CVS if it is truly missing, however the only way
to guarantee a clean slate is to upload all the models from CVS into the
repository all over again, but that is some drastic measures.

For now, I have been mostly doing damage control, and I have got the
repository to the point where a model with a valid RDF graph will stay
valid, and new data that gets inputted will also result in a valid RDF
graph (as defined by the metadata specification, and resulting graph has
been run through the W3C's RDF validation service).  I will be working on
making more formal test cases for my RDF metadata library after I get the
modification history editor onto the model repository so that James can
start using that to log down what he changes he made to the model in where
it should be going.

I am interested your ideas on how to tackle this, if any.

Kind Regards,
Tommy.
_______________________________________________
cellml-discussion mailing list
[email protected]
http://www.cellml.org/mailman/listinfo/cellml-discussion

Reply via email to