> For those models with corrupt metadata you probably want to regenerate > the RDF from the original model that was(is?) in CVS (I have an > archive of those raw files if they are 'lost'). > > Can you identify what is actually corrupt and write a small proposal > to fix it (which may involve the above?). I see this as being more > critical than anything since we don't particularly want people > downloading and trying to update essentially corrupt data. >
Okay, here it is. *Caused by Repository and 4Suite, but repairable by script I wrote:* - 4Suite's way of naming blank nodes, I doubt it should assign an http uri for that - fixed by renaming the url 'http://4suite.org/rdf/anonymous/' with 'rdf:#' - 4Suite uses rdf:ID for rdf:Seq identifiers where rdf:about should be used. - can be repaired by search/replace rdf:ID="http://4suite.org/...." with rdf:about="http://4suite.org/...." - or rdf:ID="rdf:#" with rdf:about="rdf:#" - or patch 4Suite. - this gets real fun when PCEnv tries to use that ID with the xml:base and then pushing it back into 4Suite, you start getting really long URIs that don't make sense. - cmeta:id renaming was not done properly, resulting in id mismatch between the graph and the CellML model object (thus proper querying becomes impossible). - undo the renaming of cmeta:id, where possible. - It really shouldn't be renamed in the first place. - can be impossible to fix, see next section. - cmeta:id renaming means recreation of many RDF nodes, and resource nodes were recreated as literal nodes, resulting in broken graphs. - regex replace of '>rdf:#....</ns:tag>' with ' rdf:resource="rdf:#..."/> - rdf:type are resources and not literals - regex replace - xml:base mishandling - RDF library issues, can be worked around - the CellML metadata library drops them when possible during serialization. The code that addresses the above issues currently resides in the CellMLMetadata Library https://svn.physiomeproject.org/svn/physiome/CellMLMetadata/trunk/CMLmetadata.py Search for 'def repairRdf' and 'def repair4Rdf' for the logic I use. Feel free to critique the code, there are various shortcomings (which may or may not be documented in that file). *Definitely impossible to fix, as data are truly missing:* - Incorrectly coded RDF - the way the rdf was written originally was more easy for humans to read, and so certain nodes required attribute rdf:parseType="Resource" did not have them, resulting in graphs not being parsed properly. - Thus those nodes are not generated to be in the graph. - Only solution is to fix them and reupload them, or add in the missing nodes. - Programs that truncates data due to bugs or poor programming - PCEnv at one point did not save rdf:Seq properly due to a bug in the Mozilla's RDF library. - Model Repository did not handle multiple JournalArticles properly, and it collapses all lists of authors from all citations into one in an undefined method (what the Python interpreter felt like doing at that particular time of day). - Original cmeta:id is definitely lost in almost all cases. - Model Repository used to rename/replace them. Lost data is lost. It is possible to go through each model one by one and fix what is broken, and grab the data from CVS if it is truly missing, however the only way to guarantee a clean slate is to upload all the models from CVS into the repository all over again, but that is some drastic measures. For now, I have been mostly doing damage control, and I have got the repository to the point where a model with a valid RDF graph will stay valid, and new data that gets inputted will also result in a valid RDF graph (as defined by the metadata specification, and resulting graph has been run through the W3C's RDF validation service). I will be working on making more formal test cases for my RDF metadata library after I get the modification history editor onto the model repository so that James can start using that to log down what he changes he made to the model in where it should be going. I am interested your ideas on how to tackle this, if any. Kind Regards, Tommy. _______________________________________________ cellml-discussion mailing list [email protected] http://www.cellml.org/mailman/listinfo/cellml-discussion
