Re: [oslc-core] OSLC Compact representation, titles with markup

Dave Steinberg Fri, 26 Aug 2011 16:38:39 -0400

Hi again Arthur,

Lots of good points to address, so my responses are inline (also note that
I wrote my previous, short reply to your later message in the middle of
writing this longer one, so there may be a bit of overlap).


[Disclaimer: I realize I'm talking a lot about Jena in this message. That's
because it's what I know. If there are other RDF toolkits in use, their
behaviour is certainly relevant, too, and it would be good to hear from
people with knowledge of them.]

Arthur Ryman/Toronto/IBM wrote on 08/25/2011 04:27:40 PM:
>
> 1. XML Namespaces.
>
> The spec doesn't say "for XHTML, you need to insert an xmlns attribute
for
> http://www.w3.org/1999/xhtml"; because that is part of the XHTML
> standard, i.e. it's not XHTML unless the elements are in the XHTML
namespace.

I agree that XHTML does mean using the XHTML namespace, but I also believe
it would have been helpful to underline that fact in the OSLC Core spec.
Also, I still can't find a single example in the spec that actually shows
markup in the value of an XMLLiteral-typed literal, which also would have
been helpful. I notice that RTC doesn't use a namespace, so I suppose it
wasn't obvious to them.

> 2. Jena
>
> I loaded the sample RDF/XML  into Fuseki which uses Jena and it
> produced the correct result. I assume the Jena API lets you get an
> XML DOM from the literal value.

Thanks for this! I didn't know this, and it's reassuring. I was able to
replicate this behaviour with pure Jena, simply by loading an RDF/XML
resource containing your input and printing the type and lexical form of
the literal:

Datatype: http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
Lexical form:  12345: <s xmlns="http://www.w3.org/1999/xhtml";>Null pointer
exception during startup</s>

Unfortunately, though, your assumption is not correct: Jena provides no
access to underlying DOM. It appears that during parsing it computes the
exclusive canonical form of the XML subset, as the RDF specs prescribe, and
simply records that as the literal's lexical value. But, Jena's built-in
XMLLiteral type support just uses java.lang.String as its value type. I was
surprised when I first discovered that, too, but you can see for yourself
in XMLLiteralType.parse().

> 3.  XHTML versus HTML
>
> The primary reason is that RDF supports XHTML via the XMLLiteral
> datatype. There is no parsing support for HTML built into RDF.

RDF doesn't require parsing support to use other datatypes, at least not in
the abstract sense. Abstractly, a typed literal is just a datatype
(identified by a URI) and a lexical form. Of course, it's very helpful in
an implementation if you can automatically convert that to a more useful
typed value (i.e. an Integer, Boolean, etc.), but that's purely a toolkit
concern. Jena will support any datatype via its BaseDatatype.TypedValue
wrapper type, but it's pluggable, so you can add specific parsing support
for any type you wish.

> Another strong reason is that the syntax of HTML is very irregular
> and hard to parse correctly - that is one of the reasons XML was
> invented. This is very important from a security viewpoint. To guard
> against script injection attacks, you really should parse the input
> and remove any <script> elements or Javascript attributes. Doing
> that correctly for HTML requires a full HTML parser. On the other
> hand, the XHTML is given to you as a DOM which you can easily
> traverse or process using XSLT or XPATH.

But as you said, HTML Tidy can be used to parse HTML. You could even plug
it in to Jena to do the parsing automatically for a defined datatype, which
would be more convenient than the built in XMLLiteral support. Also, I
really think that whether or not you need to parse and cleanse depends upon
what you're doing with the data. It may not be necessary for an OSLC
adapter that's merely passing along data from a trusted source, since it's
incumbent upon a security-conscious client to do that, itself.

> 4. Datatypes
>
> The specs do specify the datatypes for some properties. Look at the
> Value-Type column of the tables, e.g. [1]. You need to include the
> datatype explicitly for ints, dates, XML. etc. You specify that
> using rdf:datatype in RDF/XML, or using ^^ in Turtle.

Sorry if I wasn't clear. What I meant was that specs don't appear to say
whether consumers and producers should use/expect typed literals or plain
literals. I do see that the tables prescribe datatypes for the various
literal-typed properties, but it doesn't say anywhere whether that means
that typed literals should actually be used to specify those types, or that
plain literals should be used (since those prescribed types are already
presumed to be known).

> 5. Inventing new Datatypes
>
> The RDF spec defines the XSD datatypes and the XMLLiteral datatype.
> RDF parsers know how to parse those. If someone introduces a new
> datatype URI, it could break parsers since they won't know how to
> parse the contents. There is no standard way to define new datatypes.
>
> Try it with the RDF Validation service [2]

I had no problem parsing the following RDF/XML document (in which I've used
my own HTML type) with the service:

<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/";
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns:oslc_cm="http://open-services.net/ns/cm#"; >
  <rdf:Description rdf:about="http://example.com/bugs/2314";>
    <dcterms:title
rdf:datatype="http://open-services.net/ns/core/types#HTML";>12345:
&lt;s&gt;Null pointer exception during startup&lt;/s&gt;</dcterms:title>
    <rdf:type rdf:resource="http://open-services.net/ns/cm#ChangeRequest"/>
  </rdf:Description>
</rdf:RDF>


Cheers,
Dave

--
Dave Steinberg
IBM Rational Software
[email protected]

Re: [oslc-core] OSLC Compact representation, titles with markup

Reply via email to