[ 
https://issues.apache.org/jira/browse/TAVERNA-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stian Soiland-Reyes updated TAVERNA-1044:
-----------------------------------------
    Summary: Parsing COMBINE archive from JWSOnline skips metadata.rdf  (was: 
COMBIE parsing of JWSOnline skips metadata.rdf)

> Parsing COMBINE archive from JWSOnline skips metadata.rdf
> ---------------------------------------------------------
>
>                 Key: TAVERNA-1044
>                 URL: https://issues.apache.org/jira/browse/TAVERNA-1044
>             Project: Apache Taverna
>          Issue Type: Bug
>          Components: Taverna Language
>    Affects Versions: language 0.15.1
>            Reporter: Stian Soiland-Reyes
>            Assignee: Stian Soiland-Reyes
>            Priority: Major
>             Fix For: language 0.16.0
>
>
> When parsing a COMBINE archive from [JWS Online|http://jjj.mib.ac.uk/] such 
> as 
> http://jjj.mib.ac.uk/models/experiments/adlung2017_fig2f/export/combinearchive?download=1
>  - then the metadata.rdf does not seem to be parsed. 
> h2. Error trace
> {code}
> stain@biggie:/tmp$ curl -fO --remote-header-name 
> 'http://jjj.mib.ac.uk/models/experiments/adlung2017_fig2f/export/combinearchive?download=1'
> curl: Saved to filename 'adlung2017_fig2f.sedx'
> stain@biggie:/tmp$ java -jar 
> ~/software/taverna-tavlang-tool-0.15.1-incubating.jar convert --robundle 
> adlung2017_fig2f.sedx 
> ..
> May 10, 2018 10:35:43 AM 
> org.apache.taverna.robundle.manifest.combine.CombineManifest findAnnotations
> WARNING: Can't parse /metadata.rdf
> org.apache.jena.riot.RiotException: [line: 6, col: 43] {E202} Expecting XML 
> start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. 
> Maybe there should be an rdf:parseType='Literal' for embedding mixed XML 
> content in RDF. Maybe a striping error.
>       at 
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:128)
>       at 
> org.apache.jena.riot.lang.LangRDFXML$ErrorHandlerBridge.error(LangRDFXML.java:246)
>       at 
> org.apache.jena.rdfxml.xmlinput.impl.ARPSaxErrorHandler.error(ARPSaxErrorHandler.java:37)
>       at 
> org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:196)
>       at 
> org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:173)
>       at 
> org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:168)
>       at 
> org.apache.jena.rdfxml.xmlinput.impl.ParserSupport.warning(ParserSupport.java:194)
>       at org.apache.jena.rdfxml.xmlinput.states.Frame.warning(Frame.java:55)
>       at 
> org.apache.jena.rdfxml.xmlinput.states.Frame.characters(Frame.java:164)
>       at 
> org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.characters(XMLHandler.java:137)
>       at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown 
> Source)
>       at org.apache.xerces.impl.XMLNamespaceBinder.characters(Unknown Source)
>       at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
> Source)
>       at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>  Source)
>       at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
>       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>       at 
> org.apache.jena.rdfxml.xmlinput.impl.RDFXMLParser.parse(RDFXMLParser.java:150)
>       at org.apache.jena.rdfxml.xmlinput.ARP.load(ARP.java:118)
>       at org.apache.jena.riot.lang.LangRDFXML.parse(LangRDFXML.java:142)
>       at 
> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:175)
>       at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:905)
>       at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:256)
>       at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:242)
>       at 
> org.apache.taverna.robundle.manifest.combine.CombineManifest.parseRDF(CombineManifest.java:240)
>       at 
> org.apache.taverna.robundle.manifest.combine.CombineManifest.findAnnotations(CombineManifest.java:332)
>       at 
> org.apache.taverna.robundle.manifest.combine.CombineManifest.readCombineArchive(CombineManifest.java:465)
>       at 
> org.apache.taverna.robundle.Bundle.readOrPopulateManifest(Bundle.java:121)
>       at org.apache.taverna.robundle.Bundle.getManifest(Bundle.java:87)
>       at 
> org.apache.taverna.tavlang.tools.convert.ToRobundle.convert(ToRobundle.java:60)
>       at 
> org.apache.taverna.tavlang.tools.convert.ToRobundle.<init>(ToRobundle.java:47)
>       at 
> org.apache.taverna.tavlang.CommandLineTool$CommandConvert.runcommand(CommandLineTool.java:226)
>       at 
> org.apache.taverna.tavlang.CommandLineTool$CommandConvert.execute(CommandLineTool.java:220)
>       at 
> org.apache.taverna.tavlang.CommandLineTool.parse(CommandLineTool.java:71)
>       at 
> org.apache.taverna.tavlang.TavernaCommandline.main(TavernaCommandline.java:26)
> {code}
> h2. Analysis
> This seems to be caused by invalid RDF/XML in the metadata.rdf added by JWS 
> Online:
> {code:xml}
> stain@biggie:/tmp$ unzip adlung2017_fig2f.sedx
> stain@biggie:/tmp$ riot metadata.rdf 
> 10:39:17 ERROR riot                 :: [line: 6, col: 43] {E202} Expecting 
> XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. 
> Maybe there should be an rdf:parseType='Literal' for embedding mixed XML 
> content in RDF. Maybe a striping error.
> 10:39:17 ERROR riot                 :: [line: 43, col: 43] {E202} Expecting 
> XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. 
> Maybe there should be an rdf:parseType='Literal' for embedding mixed XML 
> content in RDF. Maybe a striping error.
> 10:39:17 ERROR riot                 :: [line: 152, col: 43] {E202} Expecting 
> XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. 
> Maybe there should be an rdf:parseType='Literal' for embedding mixed XML 
> content in RDF. Maybe a striping error.
> ...
> <file:///tmp/> <http://purl.org/dc/terms/description> "Built by JWS Online." .
> _:B5145c9a4X2Df8feX2D4a36X2Daba1X2Dacab299dd7d7 
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
> <http://purl.org/dc/terms/W3CDTF> .
> <file:///tmp/> <http://purl.org/dc/terms/created> 
> _:B5145c9a4X2Df8feX2D4a36X2Daba1X2Dacab299dd7d7 .
> <file:///tmp/models/adlung1.sbml> <http://purl.org/dc/terms/description> 
> "Exported by JWS Online from ..."
> {code}
> The broken RDF/XML follows this pattern:
> {code:xml}
>   <rdf:Description rdf:about=".">
>     <dcterms:description>Built by JWS Online.</dcterms:description>
>     <dcterms:created>
>       <dcterms:W3CDTF>2018-05-10T02:38:51Z</dcterms:W3CDTF>
>     </dcterms:created>
>   </rdf:Description>
> {code}
> As Jena points out, this is not valid RDF/XML, as here it says a property 
> dcterms:createdto a new anonymous W3CDTF resource - but a resource can't 
> directly wrap a literal. The literal needs then a new nested property like 
> <rdf:value>.
> This is probably a confusion from 
> http://identifiers.org/combine.specifications/omex.version-1 which in its 
> example, for some reason, uses dcterms:W3CDTF as a property of an untyped 
> anonymous resource under dcterms:created:
> {code:xml}
> <dcterms:created rdf:parseType="Resource">
>   <dcterms:W3CDTF>2014-06-26T10:29:00Z</dcterms:W3CDTF>
> </dcterms:created>
> {code}
> This is semantically wrong as 
> [dcterms:W3CDTF|http://dublincore.org/documents/dcmi-terms/#terms-W3CDTF] is 
> defined as a Datatype (like int), not a Property. Similarly 
> [dcterms:created|http://dublincore.org/documents/dcmi-terms/#terms-created] 
> is defined with a range rdfs:Literal, which would not include a new W3CDTF 
> Resource.
> I believe dcterms:W3CDTF is meant as a grouping of the XSD datatypes like 
> [xsd:dateTime|https://www.w3.org/TR/xmlschema11-2/#dateTime] but is listed in 
> DCTerms for pure XML users. 
> dcterms:created is more commonly used with a typed RDF literal rather than 
> through some kind of anonymous "timestamp" resource. So normal use (outside 
> COMBINE) would be:
> {code:xml}
> <dcterms:created 
> rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime";>2014-06-26T10:29:00Z</dcterms:created>
> {code}
> Our [CombineManifest 
> code|https://github.com/apache/incubator-taverna-language/blob/0.15.1-incubating/taverna-robundle/src/main/java/org/apache/taverna/robundle/manifest/combine/CombineManifest.java#L366]
>  supports both variants as the {{parseType=Resource}} variant is commonly 
> used by COMBINE producers.
> The example from JWS Online however is in-between - I have let the authors 
> know and recommended they use rdf:value or rdf:datatype variant. However the 
> tavlang converter should then recognize rdf:value 
> While it seems Jena's "riot" on the command line can ignore this syntactic 
> error and parse the other triples, loading with Jena's RDFDataMgr.read() 
> seems to bail out on the first error, meaning we also lose dcterms:creator 
> which are correctly defined in the metadata.rdf.
> This bug is to investigate if it's possible to reduce this error to a 
> warning, as well as add support for the rdf:value variant that we can 
> recommend to JWSOnline instead of the semantically broken 
> parseType="Resource" pattern.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to