Hi again Arthur, Lots of good points to address, so my responses are inline (also note that I wrote my previous, short reply to your later message in the middle of writing this longer one, so there may be a bit of overlap).
[Disclaimer: I realize I'm talking a lot about Jena in this message. That's because it's what I know. If there are other RDF toolkits in use, their behaviour is certainly relevant, too, and it would be good to hear from people with knowledge of them.] Arthur Ryman/Toronto/IBM wrote on 08/25/2011 04:27:40 PM: > > 1. XML Namespaces. > > The spec doesn't say "for XHTML, you need to insert an xmlns attribute for > http://www.w3.org/1999/xhtml" because that is part of the XHTML > standard, i.e. it's not XHTML unless the elements are in the XHTML namespace. I agree that XHTML does mean using the XHTML namespace, but I also believe it would have been helpful to underline that fact in the OSLC Core spec. Also, I still can't find a single example in the spec that actually shows markup in the value of an XMLLiteral-typed literal, which also would have been helpful. I notice that RTC doesn't use a namespace, so I suppose it wasn't obvious to them. > 2. Jena > > I loaded the sample RDF/XML into Fuseki which uses Jena and it > produced the correct result. I assume the Jena API lets you get an > XML DOM from the literal value. Thanks for this! I didn't know this, and it's reassuring. I was able to replicate this behaviour with pure Jena, simply by loading an RDF/XML resource containing your input and printing the type and lexical form of the literal: Datatype: http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral Lexical form: 12345: <s xmlns="http://www.w3.org/1999/xhtml">Null pointer exception during startup</s> Unfortunately, though, your assumption is not correct: Jena provides no access to underlying DOM. It appears that during parsing it computes the exclusive canonical form of the XML subset, as the RDF specs prescribe, and simply records that as the literal's lexical value. But, Jena's built-in XMLLiteral type support just uses java.lang.String as its value type. I was surprised when I first discovered that, too, but you can see for yourself in XMLLiteralType.parse(). > 3. XHTML versus HTML > > The primary reason is that RDF supports XHTML via the XMLLiteral > datatype. There is no parsing support for HTML built into RDF. RDF doesn't require parsing support to use other datatypes, at least not in the abstract sense. Abstractly, a typed literal is just a datatype (identified by a URI) and a lexical form. Of course, it's very helpful in an implementation if you can automatically convert that to a more useful typed value (i.e. an Integer, Boolean, etc.), but that's purely a toolkit concern. Jena will support any datatype via its BaseDatatype.TypedValue wrapper type, but it's pluggable, so you can add specific parsing support for any type you wish. > Another strong reason is that the syntax of HTML is very irregular > and hard to parse correctly - that is one of the reasons XML was > invented. This is very important from a security viewpoint. To guard > against script injection attacks, you really should parse the input > and remove any <script> elements or Javascript attributes. Doing > that correctly for HTML requires a full HTML parser. On the other > hand, the XHTML is given to you as a DOM which you can easily > traverse or process using XSLT or XPATH. But as you said, HTML Tidy can be used to parse HTML. You could even plug it in to Jena to do the parsing automatically for a defined datatype, which would be more convenient than the built in XMLLiteral support. Also, I really think that whether or not you need to parse and cleanse depends upon what you're doing with the data. It may not be necessary for an OSLC adapter that's merely passing along data from a trusted source, since it's incumbent upon a security-conscious client to do that, itself. > 4. Datatypes > > The specs do specify the datatypes for some properties. Look at the > Value-Type column of the tables, e.g. [1]. You need to include the > datatype explicitly for ints, dates, XML. etc. You specify that > using rdf:datatype in RDF/XML, or using ^^ in Turtle. Sorry if I wasn't clear. What I meant was that specs don't appear to say whether consumers and producers should use/expect typed literals or plain literals. I do see that the tables prescribe datatypes for the various literal-typed properties, but it doesn't say anywhere whether that means that typed literals should actually be used to specify those types, or that plain literals should be used (since those prescribed types are already presumed to be known). > 5. Inventing new Datatypes > > The RDF spec defines the XSD datatypes and the XMLLiteral datatype. > RDF parsers know how to parse those. If someone introduces a new > datatype URI, it could break parsers since they won't know how to > parse the contents. There is no standard way to define new datatypes. > > Try it with the RDF Validation service [2] I had no problem parsing the following RDF/XML document (in which I've used my own HTML type) with the service: <rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:oslc_cm="http://open-services.net/ns/cm#" > <rdf:Description rdf:about="http://example.com/bugs/2314"> <dcterms:title rdf:datatype="http://open-services.net/ns/core/types#HTML">12345: <s>Null pointer exception during startup</s></dcterms:title> <rdf:type rdf:resource="http://open-services.net/ns/cm#ChangeRequest"/> </rdf:Description> </rdf:RDF> Cheers, Dave -- Dave Steinberg IBM Rational Software [email protected]
