Hi all, I've been following this thread with interest, as it touches on some of the more general confusion/discomfort I've been developing over the past several weeks or months about the use of XMLLiteral with rdf:parseType="Literal" for HTML content.
Adam's comments below are particularly interesting. In general, it's not clear to me who benefits from the use of the unescaped literal representation, or in what scenario. And that approach, then, requires the use of the XMLLiteral type, which I also wonder about (as I'll explain further). If there is some benefit that I don't know about, perhaps it derails this whole line of thought. But if there isn't, could this be a case of the concrete representation tail wagging the abstract syntax dog? One thing that always struck me as odd was that rdf:parseType="Literal" examples were the only ones I could find anywhere in OSLC that use typed literals (the XMLLiteral type is implicit with this special RDF/XML syntax). Moreover, I couldn't find any guidance in the specs about the use of plain vs. typed literals at all. From the perspective of a client, anyway, it would seem a very nice thing if a particular provider would use a typed literal to tell you that a title, for example, should be treated as a simple string or as HTML content. And that's the very thing that typed literals do. It could be that the presence of an XMLLiteral type is supposed to signal the use of XHTML content, and the absence of any type is supposed to signal plain text. But I couldn't find that spelled out anywhere -- if it is, perhaps it's hard to find, or perhaps I just did a poor job of looking -- and I'd argue it would be better to include types in both cases. [1] It's this line of thinking that leads me to question the use of XMLLiteral in the first place. I saw in some old discussions that the intention in OSLC was not for XMLLiteral to imply XHTML necessarily. Using it for other XML languages was considered and endorsed, in principle. But where does that leave XHTML? With a type that doesn't really say what it is or what you can do with it. We have specs that communicate the XHMTL intent in words, but we also have a mechanism built into RDF that could communicate this, and we're not really using it fully. Thus, I think it would be preferable to define and use a type that specifically represents HTML. And note, I suggest HTML, not XHTML, since using any type other than XMLLiteral eliminates the "benefit" of the special rdf:parseType="Literal" syntax. And without that, I don't see a particular benefit in the stricter XHTML syntax. One other possibility that I've considered, which Arthur suggested previously, is using a namespace to identify that the XML is XHTML, in particular, instead of doing it directly in the literal type. And I believe that, strictly, the XHTML namespace is required for the elements to be valid XHTML. But I found no hint of this in the spec or any examples, and certainly RTC doesn't do this (I haven't checked other providers). Moreover, I believe it's also a worse approach, since there's no guarantee that your RDF runtime of choice will give you access to namespaces declared on the property element (I don't believe Jena does), and detecting a namespace inside the element content would require actually parsing the value as XML. If all you want to do is pass markup along for display in a browser, it would be unfortunate to have to actually parse the content to determine that it's XHTML. And this is where I close the loop on my thinking, by coming back to how a consumer might actually want to make use of HTML content. Even outside of the compact rendering scenario, ultimately it's probably going to get displayed by a browser, whether as part of a larger Web page or in a browser-backed widget in a rich client. And for that, HTML is probably just as good as, if not better than, XHTML. Rather than worrying about whether the content is well-formed XML, it's probably sufficient to just give it to the browser and see what it can do with it. I would assert that "something a browser can render" has been the working definition of HTML for a good number of years now, while XHTML has largely faded in importance. Going the other way, the appeal of HTML really shows. If a provider natively deals with HTML (without concern for XML well-formedness), it would be attractive to not have to convert that into XHTML to expose it via OSLC. Likewise, a consumer may use a rich text control that yields HTML. Generalized parsing of HTML for conversion to XHTML is non-trivial, and it seems unfortunate to impose that conversion task onto everyone, just so that we can use rdf:parseType="Literal" in RDF/XML and avoid applying normal XML encoding to markup content (of course, some encoding will likely be required for other RDF syntaxes anyway). So, those are my thoughts on this (admittedly enlarged) topic. Even if they all do make perfect sense (and I'm not necessarily claiming they do), I realize we may be well past the point of being able to act on them. Still, I thought I'd put them out there and see what others make of them. Cheers, Dave [1] In fact, I think that the consistent use of typed literals in general would be beneficial. You could even imagine exploiting them as a compatibility measure, if it was decided that the type of a property needed to change. This is a related, but separate, topic, which I'd be thrilled to discuss further, but I don't want to open too many cans of worms at once. [2] Or, perhaps, a less kind way of putting that is that the XHTML namespace is required for the elements to -- Dave Steinberg IBM Rational Software [email protected] From: Adam Archer/Toronto/IBM@IBMCA To: Arthur Ryman/Toronto/IBM@IBMCA Cc: "[email protected]" <[email protected]>, Randy Hudson <[email protected]>, [email protected] Date: 08/22/2011 06:20 PM Subject: Re: [oslc-core] OSLC Compact representation, titles with markup Sent by: [email protected] The big concern to me is not the ability to process the RDF/XML with XPath, it's the ability to do so in a browser environment. Currently all implementations of all rich hovers in all Jazz based products encode any html tags in their dcterms:title attributes (and doubly encode special characters). For the consumer on the browser side, this means simply taking the content of the attribute, decoding it (which browsers are very good at) and slapping the result into the dom (which browsers are also very good at). The alternative would be a total consumability nightmare from the point of view of a browser (which is the most important consumer of this entire spec). If the tags are actually child nodes in the xml representation, it means we will have child elements in the resulting document that we get back from the xml http request which means we will have to traverse a dom tree and recreate a structure which could easily be represented as an escaped string, like everyone is doing today. I realize that implementation is not supposed to lead the spec, but I don't even think that would be the case here. The oslc compact spec grew organically out of the old jazz compact rendering spec which can be found here: https://jazz.net/wiki/bin/view/Sandbox/CompactRenderingV1P1 If we look at the semantic description of the dc:title and jp:abbreviation it states explicitly that the content MUST be escaped: > The HTML markup MUST be escaped; for example, "<b>" as "<b>". This decision was made consciously for very well defined technical reasons (discussed above) in the original spec. If that decision was reversed in the creation of the OSLC compact spec then I believe that to have been a huge mistake and would like to see the spec fixed rather than for all providers to have to change how their compact documents are served and all consumers to have to go to the trouble of walking the dom to determine what the provider is actually trying to show. Adam Archer Jazz Developer IBM Toronto Lab From: Arthur Ryman/Toronto/IBM To: Samuel Padgett <[email protected]> Cc: Adam Archer/Toronto/IBM@IBMCA, Randy Hudson <[email protected]>, "[email protected]" <[email protected]>, [email protected] Date: 08/22/2011 04:40 PM Subject: Re: [oslc-core] OSLC Compact representation, titles with markup Sam, You wrote: It's very difficult to parse the former using XPath. For instance, the expression "/oslc:Compact/dcterms:title" takes out the "<s>" and "</s>" I don't think problems using XPath are a valid reason to encode markup since RDF/XML itselt is very difficult to process using XPath. At one point we tried to define an OSLC-variant of RDF/XML that looked like "normal" XML. However, we abandonned that and now require support for generic RDF/XML. The are many equivalent ways to represent a given set of triples in RDF/XML. It would therefore be very problematic to use XPath, XSLT, or XQuery to process RDF/XML. The safe way to process RDF/XML is to use an RDF toolkit like Jena. Regards, ___________________________________________________________________________ Arthur Ryman DE, PPM Chief Architect IBM Software, Rational Toronto Lab | +1-905-413-3077 Twitter | Facebook | YouTube From: Samuel Padgett <[email protected]> To: "[email protected]" <[email protected]> Cc: Adam Archer/Toronto/IBM@IBMCA, Randy Hudson <[email protected]> Date: 08/07/2011 01:01 PM Subject: [oslc-core] OSLC Compact representation, titles with markup Sent by: [email protected] I believe the spec is a bit confusing when it comes to titles with markup for UI Preview. The Compact representation has a dcterms:title property. It's defined as an XML Literal that can contain XHTML markup [1]. My understanding of XML Literals as discussed in the RDF Primer [2] means a title with markup would look like this, <dcterms:title>12345: <s>Null pointer exception during startup</s></dcterms:title> The example [3] of this resource has a title like this, however, <dcterms:title> 12345: <s>Null pointer exception during startup</s> </dcterms:title> The example doesn't seem to fit with the description. It's very difficult to parse the former using XPath. For instance, the expression "/oslc:Compact/dcterms:title" takes out the "<s>" and "</s>" Most implementations I'm aware also follow the example where markup is encoded. It means special characters need to be "double encoded." For instance, "12345: Values > 1000 incorrectly calculated" would be, <dcterms:title>12345: Values &gt; 1000 incorrectly calculated</dcterms:title> I think we should add more clarity to the spec here, as getting this wrong can open up consumers to cross-site scripting attacks. I'd also suggest we say that providers MUST NOT use any markup with a <script> tag and consumer MUST NOT display any markup with a <script> tag to guard against this problem. Best Regards, Sam [1] http://open-services.net/bin/view/Main/OslcCoreUiPreview?sortcol=table;up=#Representation_Compact [2] http://www.w3.org/TR/rdf-syntax/#xmlliterals [3] http://open-services.net/bin/view/Main/OslcCoreUiPreview?sortcol=table;up=#XML_Representation_Format _______________________________________________ Oslc-Core mailing list [email protected] http://open-services.net/mailman/listinfo/oslc-core_open-services.net _______________________________________________ Oslc-Core mailing list [email protected] http://open-services.net/mailman/listinfo/oslc-core_open-services.net
