Stian Soiland-Reyes created TAVERNA-1044:
--------------------------------------------
Summary: COMBIE parsing of JWSOnline skips metadata.rdf
Key: TAVERNA-1044
URL: https://issues.apache.org/jira/browse/TAVERNA-1044
Project: Apache Taverna
Issue Type: Bug
Components: Taverna Language
Affects Versions: language 0.15.1
Reporter: Stian Soiland-Reyes
Assignee: Stian Soiland-Reyes
Fix For: language 0.16.0
When parsing a COMBINE archive from [JWS Online|http://jjj.mib.ac.uk/] such as
http://jjj.mib.ac.uk/models/experiments/adlung2017_fig2f/export/combinearchive?download=1
- then the metadata.rdf does not seem to be parsed.
h2. Error trace
{code}
stain@biggie:/tmp$ curl -fO --remote-header-name
'http://jjj.mib.ac.uk/models/experiments/adlung2017_fig2f/export/combinearchive?download=1'
curl: Saved to filename 'adlung2017_fig2f.sedx'
stain@biggie:/tmp$ java -jar
~/software/taverna-tavlang-tool-0.15.1-incubating.jar convert --robundle
adlung2017_fig2f.sedx
..
May 10, 2018 10:35:43 AM
org.apache.taverna.robundle.manifest.combine.CombineManifest findAnnotations
WARNING: Can't parse /metadata.rdf
org.apache.jena.riot.RiotException: [line: 6, col: 43] {E202} Expecting XML
start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. Maybe
there should be an rdf:parseType='Literal' for embedding mixed XML content in
RDF. Maybe a striping error.
at
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:128)
at
org.apache.jena.riot.lang.LangRDFXML$ErrorHandlerBridge.error(LangRDFXML.java:246)
at
org.apache.jena.rdfxml.xmlinput.impl.ARPSaxErrorHandler.error(ARPSaxErrorHandler.java:37)
at
org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:196)
at
org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:173)
at
org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:168)
at
org.apache.jena.rdfxml.xmlinput.impl.ParserSupport.warning(ParserSupport.java:194)
at org.apache.jena.rdfxml.xmlinput.states.Frame.warning(Frame.java:55)
at
org.apache.jena.rdfxml.xmlinput.states.Frame.characters(Frame.java:164)
at
org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.characters(XMLHandler.java:137)
at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown
Source)
at org.apache.xerces.impl.XMLNamespaceBinder.characters(Unknown Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.apache.jena.rdfxml.xmlinput.impl.RDFXMLParser.parse(RDFXMLParser.java:150)
at org.apache.jena.rdfxml.xmlinput.ARP.load(ARP.java:118)
at org.apache.jena.riot.lang.LangRDFXML.parse(LangRDFXML.java:142)
at
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:175)
at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:905)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:256)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:242)
at
org.apache.taverna.robundle.manifest.combine.CombineManifest.parseRDF(CombineManifest.java:240)
at
org.apache.taverna.robundle.manifest.combine.CombineManifest.findAnnotations(CombineManifest.java:332)
at
org.apache.taverna.robundle.manifest.combine.CombineManifest.readCombineArchive(CombineManifest.java:465)
at
org.apache.taverna.robundle.Bundle.readOrPopulateManifest(Bundle.java:121)
at org.apache.taverna.robundle.Bundle.getManifest(Bundle.java:87)
at
org.apache.taverna.tavlang.tools.convert.ToRobundle.convert(ToRobundle.java:60)
at
org.apache.taverna.tavlang.tools.convert.ToRobundle.<init>(ToRobundle.java:47)
at
org.apache.taverna.tavlang.CommandLineTool$CommandConvert.runcommand(CommandLineTool.java:226)
at
org.apache.taverna.tavlang.CommandLineTool$CommandConvert.execute(CommandLineTool.java:220)
at
org.apache.taverna.tavlang.CommandLineTool.parse(CommandLineTool.java:71)
at
org.apache.taverna.tavlang.TavernaCommandline.main(TavernaCommandline.java:26)
{code}
h2. Analysis
This seems to be caused by invalid RDF/XML in the metadata.rdf added by JWS
Online:
{code:xml}
stain@biggie:/tmp$ unzip adlung2017_fig2f.sedx
stain@biggie:/tmp$ riot metadata.rdf
10:39:17 ERROR riot :: [line: 6, col: 43] {E202} Expecting XML
start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. Maybe
there should be an rdf:parseType='Literal' for embedding mixed XML content in
RDF. Maybe a striping error.
10:39:17 ERROR riot :: [line: 43, col: 43] {E202} Expecting XML
start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. Maybe
there should be an rdf:parseType='Literal' for embedding mixed XML content in
RDF. Maybe a striping error.
10:39:17 ERROR riot :: [line: 152, col: 43] {E202} Expecting
XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed.
Maybe there should be an rdf:parseType='Literal' for embedding mixed XML
content in RDF. Maybe a striping error.
...
<file:///tmp/> <http://purl.org/dc/terms/description> "Built by JWS Online." .
_:B5145c9a4X2Df8feX2D4a36X2Daba1X2Dacab299dd7d7
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/dc/terms/W3CDTF> .
<file:///tmp/> <http://purl.org/dc/terms/created>
_:B5145c9a4X2Df8feX2D4a36X2Daba1X2Dacab299dd7d7 .
<file:///tmp/models/adlung1.sbml> <http://purl.org/dc/terms/description>
"Exported by JWS Online from ..."
{code}
The broken RDF/XML follows this pattern:
{code:xml}
<rdf:Description rdf:about=".">
<dcterms:description>Built by JWS Online.</dcterms:description>
<dcterms:created>
<dcterms:W3CDTF>2018-05-10T02:38:51Z</dcterms:W3CDTF>
</dcterms:created>
</rdf:Description>
{code}
As Jena points out, this is not valid RDF/XML, as here it says a property
dcterms:createdto a new anonymous W3CDTF resource - but a resource can't
directly wrap a literal. The literal needs then a new nested property like
<rdf:value>.
This is probably a confusion from
http://identifiers.org/combine.specifications/omex.version-1 which in its
example, for some reason, uses dcterms:W3CDTF as a property of an untyped
anonymous resource under dcterms:created:
{code:xml}
<dcterms:created rdf:parseType="Resource">
<dcterms:W3CDTF>2014-06-26T10:29:00Z</dcterms:W3CDTF>
</dcterms:created>
{code}
This is semantically wrong as
[dcterms:W3CDTF|http://dublincore.org/documents/dcmi-terms/#terms-W3CDTF] is
defined as a Datatype (like int), not a Property. Similarly
[dcterms:created|http://dublincore.org/documents/dcmi-terms/#terms-created] is
defined with a range rdfs:Literal, which would not include a new W3CDTF
Resource.
I believe dcterms:W3CDTF is meant as a grouping of the XSD datatypes like
[xsd:dateTime|https://www.w3.org/TR/xmlschema11-2/#dateTime] but is listed in
DCTerms for pure XML users.
dcterms:created is more commonly used with a typed RDF literal rather than
through some kind of anonymous "timestamp" resource. So normal use (outside
COMBINE) would be:
{code:xml}
<dcterms:created
rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-06-26T10:29:00Z</dcterms:created>
{code}
Our [CombineManifest
code|https://github.com/apache/incubator-taverna-language/blob/0.15.1-incubating/taverna-robundle/src/main/java/org/apache/taverna/robundle/manifest/combine/CombineManifest.java#L366]
supports both variants as the {{parseType=Resource}} variant is commonly used
by COMBINE producers.
The example from JWS Online however is in-between - I have let the authors know
and recommended they use rdf:value or rdf:datatype variant. However the tavlang
converter should then recognize rdf:value
While it seems Jena's "riot" on the command line can ignore this syntactic
error and parse the other triples, loading with Jena's RDFDataMgr.read() seems
to bail out on the first error, meaning we also lose dcterms:creator which are
correctly defined in the metadata.rdf.
This bug is to investigate if it's possible to reduce this error to a warning,
as well as add support for the rdf:value variant that we can recommend to
JWSOnline instead of the semantically broken parseType="Resource" pattern.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)